<a href="https://colab.research.google.com/github/doronschwartz/NLP/blob/main/HW3/sentiment_hw.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Homework #3 is to study Sentiment Analysis with five types of models:

1.	Rule-based
2.	Bag of Words
3.	Shallow embedding with CNN
4.  LSTM
5.	Transformer Models

I got this idea from my student Itay Etelis and his Huggingface depo is at:
https://huggingface.co/pig4431

We will discuss all five possibilities in class.

## YELP Reviews DataSet
The **YELP** reviews dataset consists of reviews from Yelp. It is extracted from the Yelp Dataset Challenge 2015 data.

The Yelp reviews full star dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the Yelp Dataset Challenge 2015. It is first used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).  Reviews with 1, 2 stars has been marked negative.
Reviews with 4, 5 stars has been marked positive.

Reviews with 3 stars has been filtered.

Full information about this dataset is at:
https://www.kaggle.com/datasets/ilhamfp31/yelp-review-dataset

You can either download this dataset from there, or use Itay's code with a train/test/validation partition:

In [5]:
!pip install datasets -q

import datasets
datasets.logging.set_verbosity_error()
datasets.disable_progress_bar()

from datasets import load_dataset
yelp = load_dataset("pig4431/yelp_train25k_test5k_valid5k")
yelp_train = yelp['train']
yelp_validate = yelp['validate']
yelp_test = yelp['test']

In [3]:
yelp_train[0]

{'label': -1,
 'text': "Overall - do not ever stay here if you can avoid it.  I will be posting this review to Hotels.com as well to try and help other travelers.  Read on for details.\\n\\nI got put here by US Airways after my flight was cancelled due to 'aircraft maintenance' and after the attendant apologized for having no other choice but this Microtel, I thought 'How bad can it be?'  Well let me tell you.  As a pretty well traveled guy who takes a pride in being comfortable anywhere, no matter how run-down or low-maintenance, and has thoroughly enjoyed sleeping in dirt-floor mosquito-net bungalows in Laos - the only way to possibly be comfortable here is to down a bucket of beer and pass out cold.\\n\\nYou want to forget you are staying here.  BEFORE you stay here.\\n\\nYou want to check your bed for traces of bed bugs.\\n\\nYou stare in awe as two police officers remove an obviously cracked out prostitute from the ice-machine area.\\n\\nEven through your alcoholic haze you can te

The advantage to having a testing dataset is that you can tune certain parameters and then see if they validate correctly.  This can be done for all 5 models, but is particularly easy for the first two models: Rule-based and Bag-of-Words models.

For example, you can use the rule based approach, VADER:

https://www.geeksforgeeks.org/python-sentiment-analysis-using-vader/
by installing:


In [4]:
!pip install vaderSentiment


Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/126.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m122.9/126.0 kB[0m [31m4.4 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.0/126.0 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: vaderSentiment
Successfully installed vaderSentiment-3.3.2


In [5]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
sid_obj = SentimentIntensityAnalyzer()
sentiment_dict = sid_obj.polarity_scores("I love programming!")

In [6]:
print(sentiment_dict)

{'neg': 0.0, 'neu': 0.308, 'pos': 0.692, 'compound': 0.6696}


Note that this model yields four different scores: neg, neu, pos, and compound. We'll talk about these four possibilities in class.

Note that the compound score is meant to be based of *all* lexicon ratings and is normalized between -1 (most extreme negative) and +1 (most extreme positive).  In the link above they suggest the thresholds for this value:

positive sentiment : (compound score >= 0.05)
neutral sentiment : (compound score > -0.05) and (compound score < 0.05)
negative sentiment : (compound score <= -0.05)

However, please check if this is check the actually best threshold for the given dataset but checking a range of values such as:

for compound_score in np.arange(-1, 1, 0.1):

which can be checked in test dataset and then validated.

Please do so!

Please implement the rule-based model here with the hyperparameter
tuning for the compound_score.

In [7]:
import numpy as np
from nltk.sentiment import SentimentIntensityAnalyzer
from sklearn.metrics import accuracy_score
import nltk
nltk.download('vader_lexicon')


[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


True

In [8]:

# Define a function to calculate the compound score
def calculate_compound_score(text):
    sid = SentimentIntensityAnalyzer()
    compound_score = sid.polarity_scores(text)['compound']
    return compound_score

# Implement the rule-based model with hyperparameter tuning
best_threshold = None
best_accuracy = 0

num_thresholds = len(np.arange(-1, 1, 0.5))
current_threshold = 0

for compound_threshold in np.arange(-1, 1, 0.5):
    val_predictions = []
    for example in yelp_validate:
        text = example['text']
        compound_score = calculate_compound_score(text)
        sentiment = 0 if abs(compound_score) < abs(compound_threshold) else (1 if compound_score > 0 else -1)
        val_predictions.append(sentiment)

    true_labels = [example['label'] for example in yelp_validate]
    accuracy = accuracy_score(true_labels, val_predictions)

    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_threshold = compound_threshold

    current_threshold += 1
    print(f"Progress: {current_threshold}/{num_thresholds}")

print("Best compound score threshold:", best_threshold)

# Evaluate the model's performance on the test set
test_predictions = []
for example in yelp_test:
    text = example['text']
    compound_score = calculate_compound_score(text)
    sentiment = 0 if abs(compound_score) < abs(best_threshold) else (1 if compound_score > 0 else -1)
    test_predictions.append(sentiment)

true_labels = [example['label'] for example in yelp_test]
test_accuracy = accuracy_score(true_labels, test_predictions)

print("Test accuracy:", test_accuracy)




Progress: 1/4
Progress: 2/4
Progress: 3/4
Progress: 4/4
Best compound score threshold: 0.0
Test accuracy: 0.7042


Now, please implement a Bag of Words model.  Check if feature selection works and validate on the validation dataset.

Which words are most strongly correlated to positive sentiment?  Which are strongly correlated to negative sentiment?  One way to check is find those words with high PMI to positive words (e.g. excellent, great) and those with negative words (bad, terrible).

Please work similarly to what we did in the first homework and feel free to adapt your PMI corde from there.

Also, similar to the first homework, using the train and test datasets, find the number of features to choose.  Then validate this amount using the validation dataset.

In [10]:
import numpy as np
from collections import Counter
from scipy.sparse import lil_matrix
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import word_tokenize

# Tokenize the text data
def tokenize(text):
    return word_tokenize(text.lower())

# Count the occurrences of each word in the corpus
def count_words(corpus):
    word_count = Counter()
    for text in corpus:
        tokens = tokenize(text)
        word_count.update(tokens)
    return word_count


In [11]:
# Implement feature selection to determine the most relevant words
def select_features(corpus, max_features=None):
    word_count = count_words(corpus)
    if max_features:
        selected_features = [word for word, _ in word_count.most_common(max_features)]
    else:
        selected_features = [word for word, _ in word_count.items()]
    return selected_features

In [12]:
# Convert text data to Bag of Words representation
def text_to_bow(corpus, selected_features):
    vectorizer = CountVectorizer(vocabulary=selected_features, tokenizer=tokenize)
    X = vectorizer.fit_transform(corpus)
    return X


In [37]:
# Calculate Pointwise Mutual Information (PMI)
def pmi(pos_counts, neg_counts, word_counts, total_docs):
    pmi_scores = {}
    for word, count in word_counts.items():
        # Calculate PMI only for words occurring at least 5 times
        if count < 5:
            continue
        pos_prob = pos_counts.get(word, 0) / total_docs
        neg_prob = neg_counts.get(word, 0) / total_docs
        print(f"Word: {word}, Count: {count}, Pos_prob: {pos_prob}, Neg_prob: {neg_prob}")
        if count != 0 and pos_prob != 0 and neg_prob != 0:  # Add additional checks
            pmi_score = np.log2((pos_prob * total_docs) / (count * (pos_prob + neg_prob)))
            pmi_scores[word] = pmi_score
    return pmi_scores


In [31]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


# Train the model and calculate accuracy
def train_model(X_train, y_train, X_val, y_val):
    # Train a logistic regression classifier
    clf = LogisticRegression()
    clf.fit(X_train, y_train)

    # Predict on validation set
    y_pred = clf.predict(X_val)

    # Calculate accuracy
    accuracy = accuracy_score(y_val, y_pred)
    return accuracy

In [4]:
# Load the Yelp dataset
import datasets
datasets.logging.set_verbosity_error()
datasets.disable_progress_bar()

from datasets import load_dataset
yelp = load_dataset("pig4431/yelp_train25k_test5k_valid5k")
yelp_train = yelp['train']
yelp_validate = yelp['validate']
yelp_test = yelp['test']

# Extract reviews and labels
train_reviews = [d['text'] for d in yelp_train]
train_labels = [d['label'] for d in yelp_train]
val_reviews = [d['text'] for d in yelp_validate]
val_labels = [d['label'] for d in yelp_validate]
test_reviews = [d['text'] for d in yelp_test]
test_labels = [d['label'] for d in yelp_test]


In [32]:
import nltk
nltk.download('punkt')

# Determine the number of features to choose
max_features_list = [100, 500, 1000, 5000, 10000]
best_accuracy = 0
best_num_features = None

for max_features in max_features_list:
    # Feature selection
    selected_features = select_features(train_reviews, max_features)

    # Convert text data to Bag of Words representation
    X_train = text_to_bow(train_reviews, selected_features)
    X_val = text_to_bow(val_reviews, selected_features)

    # Train the model and calculate accuracy
    accuracy = train_model(X_train, train_labels, X_val, val_labels)

    # Update best number of features if accuracy improves
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_num_features = max_features

print("Best number of features:", best_num_features)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://sciki

Best number of features: 5000


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [33]:
# Validate the chosen number of features using the validation dataset
selected_features = select_features(train_reviews, best_num_features)
X_train = text_to_bow(train_reviews, selected_features)
X_val = text_to_bow(val_reviews, selected_features)
accuracy = train_model(X_train, train_labels, X_val, val_labels)
print("Accuracy on validation set with best number of features:", accuracy)




Accuracy on validation set with best number of features: 0.913


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [49]:
print("Accuracy on validation set with best number of features:", accuracy)

Accuracy on validation set with best number of features: 0.913


In [51]:
print(best_num_features)

5000


In [50]:
# Analyze the correlation of words with positive and negative sentiment using PMI
pos_reviews = [review for review, label in zip(train_reviews, train_labels) if label == 1]
neg_reviews = [review for review, label in zip(train_reviews, train_labels) if label == 0]

pos_word_counts = count_words(pos_reviews)
neg_word_counts = count_words(neg_reviews)
total_docs = len(train_reviews)
word_counts = count_words(train_reviews)

pmi_pos = pmi(pos_word_counts, neg_word_counts, word_counts, total_docs)
pmi_neg = pmi(neg_word_counts, pos_word_counts, word_counts, total_docs)

# Sort words by PMI scores
sorted_pmi_pos = sorted(pmi_pos.items(), key=lambda x: x[1], reverse=True)
#sorted_pmi_neg = sorted(pmi_neg.items(), key=lambda x: x[1], reverse=True)


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Word: banging, Count: 14, Pos_prob: 0.0, Neg_prob: 0.00012
Word: dinky, Count: 9, Pos_prob: 0.0, Neg_prob: 8e-05
Word: shreds, Count: 9, Pos_prob: 0.0, Neg_prob: 4e-05
Word: wedges, Count: 27, Pos_prob: 0.0, Neg_prob: 0.00064
Word: proportion, Count: 6, Pos_prob: 0.0, Neg_prob: 8e-05
Word: 41, Count: 8, Pos_prob: 0.0, Neg_prob: 8e-05
Word: \nnot, Count: 27, Pos_prob: 0.0, Neg_prob: 0.00028
Word: natives, Count: 5, Pos_prob: 0.0, Neg_prob: 8e-05
Word: colour, Count: 14, Pos_prob: 0.0, Neg_prob: 0.00028
Word: observing, Count: 10, Pos_prob: 0.0, Neg_prob: 0.00016
Word: intent, Count: 14, Pos_prob: 0.0, Neg_prob: 0.00012
Word: threaten, Count: 8, Pos_prob: 0.0, Neg_prob: 0.0
Word: \n\nunfortunately, Count: 26, Pos_prob: 0.0, Neg_prob: 0.0002
Word: begrudgingly, Count: 10, Pos_prob: 0.0, Neg_prob: 0.00012
Word: shadow, Count: 13, Pos_prob: 0.0, Neg_prob: 0.00024
Word: \n\nwhere, Count: 9, Pos_prob: 0.0, Neg_prob: 0.00012
Word

In [45]:

print("Top words correlated with positive sentiment:")
for word, pmi_score in sorted_pmi_pos[:10]:
    print(f"{word}: PMI = {pmi_score}")

print("\nTop words correlated with negative sentiment:")
for word, pmi_score in sorted_pmi_neg[:10]:
    print(f"{word}: PMI = {pmi_score}")


Top words correlated with positive sentiment:

Top words correlated with negative sentiment:


Now, please implement an embedded model with a CNN similar to the previous homework. Feel free to use either Glove *or* Word2Vec -- whichever worked better for you in the previous homework. No need to check both.

In [2]:
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Conv1D, GlobalMaxPooling1D, Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.metrics import accuracy_score
import gensim.downloader as api

# Download and load GloVe embeddings
glove_model = api.load("glove-wiki-gigaword-300")




In [3]:
import re
# Preprocess text data to remove contractions
def preprocess_text(text):
    # Remove contractions
    text = re.sub(r"'\w+", '', text)
    return text
# Tokenize text data
def tokenize_text(texts, tokenizer, max_length):
    sequences = tokenizer.texts_to_sequences(texts)
    padded_sequences = pad_sequences(sequences, maxlen=max_length, padding='post', truncating='post')
    return padded_sequences

# Define parameters
max_length = 100  # Maximum sequence length
num_classes = 1  # Number of classes (positive or negative sentiment)

In [2]:
# Extract reviews and labels
train_reviews = [d['text'] for d in yelp_train]
train_labels = [0 if d['label'] == -1 else 1 for d in yelp_train]  # Change labels to 0 and 1
val_reviews = [d['text'] for d in yelp_validate]
val_labels = [0 if d['label'] == -1 else 1 for d in yelp_validate]  # Change labels to 0 and 1
test_reviews = [d['text'] for d in yelp_test]
test_labels = [0 if d['label'] == -1 else 1 for d in yelp_test]

In [5]:
# Prepare data
tokenizer = Tokenizer()
tokenizer.fit_on_texts(train_reviews)
train_reviews_preprocessed = [preprocess_text(text) for text in train_reviews]
val_reviews_preprocessed = [preprocess_text(text) for text in val_reviews]
X_train = tokenize_text(train_reviews_preprocessed, tokenizer, max_length)
X_val = tokenize_text(val_reviews_preprocessed, tokenizer, max_length)

# Create embedding matrix
word_index = tokenizer.word_index
embedding_dim = len(glove_model['word'])  # Dimension of word embeddings
embedding_matrix = np.zeros((len(word_index) + 1, embedding_dim))  # Initialize embedding matrix
for word, i in word_index.items():
    if word in glove_model:
        embedding_matrix[i] = glove_model[word]



In [6]:
# Build CNN model
def build_cnn_model(embedding_matrix, max_length, num_classes):
    model = Sequential()
    model.add(Embedding(input_dim=embedding_matrix.shape[0], output_dim=embedding_matrix.shape[1],
                        weights=[embedding_matrix], input_length=max_length, trainable=False))
    model.add(Conv1D(filters=128, kernel_size=5, activation='relu'))
    model.add(GlobalMaxPooling1D())
    model.add(Dense(64, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(num_classes, activation='sigmoid'))
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    return model

In [7]:
# Convert labels to numpy arrays
train_labels_array = np.array(train_labels)
val_labels_array = np.array(val_labels)

# Train CNN model
model = build_cnn_model(embedding_matrix, max_length, num_classes)
model.fit(X_train, train_labels_array, validation_data=(X_val, val_labels_array), epochs=10, batch_size=128,
          callbacks=[EarlyStopping(patience=3, restore_best_weights=True)])


# Predict on validation set
y_pred = model.predict(X_val)
y_pred_binary = (y_pred > 0.5).astype(int)

# Calculate accuracy
accuracy = accuracy_score(val_labels, y_pred_binary)
print("Validation Accuracy:", accuracy)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Validation Accuracy: 0.873


Next, please try a LSTM model with Keras' word embedding.
I personally liked the tutorial here:

https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/.  Note that certain lines like:

top_words = 5000

(X_train, y_train), (X_test, y_test) = imdb.load_data

(num_words=top_words)

will need to be tweaked. Feel free to use the number of top_words you had in the Bag-of_words model for the parameter top_words.

In [27]:
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Embedding, Dense
from tensorflow.keras.preprocessing import sequence
from sklearn.metrics import accuracy_score
from datasets import load_dataset

# Load the Yelp dataset
yelp = load_dataset("pig4431/yelp_train25k_test5k_valid5k")
yelp_train = yelp['train']
yelp_test = yelp['test']
yelp_valid = yelp['validate']

# Preprocess the reviews and labels
max_review_length = 100  # Maximum review length
X_train = [review['text'] for review in yelp_train]
X_test = [review['text'] for review in yelp_test]
X_valid = [review['text'] for review in yelp_valid]
y_train = [int(review['label']) if review['label'] != -1 else 0 for review in yelp_train]  # Convert -1 labels to 0
y_test = [int(review['label']) if review['label'] != -1 else 0 for review in yelp_test]  # Convert -1 labels to 0
y_valid = [int(review['label']) if review['label'] != -1 else 0 for review in yelp_valid]  # Convert -1 labels to 0
top_words = 5000

top_words = 5000

# Convert text to sequences and pad sequences



In [28]:
tokenizer = Tokenizer(num_words=top_words)
tokenizer.fit_on_texts(X_train)
X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)
X_valid = tokenizer.texts_to_sequences(X_valid)

X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)
X_valid = sequence.pad_sequences(X_valid, maxlen=max_review_length)


In [29]:
# Build the LSTM model
embedding_vector_length = 32  # Dimension of word embeddings
model = Sequential()
model.add(Embedding(top_words, embedding_vector_length, input_length=max_review_length))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

X_valid = np.array(X_valid)
y_valid = np.array(y_valid)
y_train = np.array(y_train)


# Train the model
model.fit(X_train, y_train, validation_data=(X_valid, y_valid), epochs=3, batch_size=64)




Model: "sequential_8"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_8 (Embedding)     (None, 100, 32)           160000    
                                                                 
 lstm_7 (LSTM)               (None, 100)               53200     
                                                                 
 dense_9 (Dense)             (None, 1)                 101       
                                                                 
Total params: 213301 (833.21 KB)
Trainable params: 213301 (833.21 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
None
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.src.callbacks.History at 0x7f60dc70ee60>

In [30]:
# Evaluate the model on the test set
y_test = np.array(y_test)
X_test = np.array(X_test)
y_pred_probs = model.predict(X_test)
y_pred = (y_pred_probs > 0.5).astype(int)  # Convert probabilities to binary labels

# Calculate test accuracy
test_accuracy = accuracy_score(y_test, y_pred)
print("Test Accuracy:", test_accuracy)

Test Accuracy: 0.902


Now try a transformer model.  While you can train it from scratch, I suggest you don't and use something what we discussed in class:
https://colab.research.google.com/drive/15fisDt6RHTdFnkskokD9-jJ9luEbv-z3?usp=sharing

While this dataset was developed for SST2 (Stanford Sentiment Treebank v2), feel free to use it "as is" and without any fine tuning to the model. However, do please check if a different sentiment threshold would work better for this specific datset similar to what you did in the Vader model. It may be that the threshold will need to be tuned here too.

In [6]:
!pip install transformers



In [7]:
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

for j in range(len(val_reviews)):
    tokenized_segments = tokenizer(val_reviews[j], return_tensors="pt", padding=True, truncation=True)
    tokenized_segments_input_ids, tokenized_segments_attention_mask = tokenized_segments.input_ids, tokenized_segments.attention_mask
    model_predictions = F.softmax(model(input_ids=tokenized_segments_input_ids, attention_mask=tokenized_segments_attention_mask)['logits'], dim=1)

    # Print the model predictions
    print(f"Review: {val_reviews[j]}")
    print("Prediction:", model_predictions.tolist())
    print("-" * 50)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Prediction: [[0.00011977124813711271, 0.9998801946640015]]
--------------------------------------------------
Review: Can't speak for the rooms, but the Casino remodel is very nice and provides a much needed improvement.  I enjoyed gambling there for a couple of hours.
Prediction: [[0.0002279087493661791, 0.9997721314430237]]
--------------------------------------------------
Review: Absolutely horrendous service, pizza that is basically worse than frozen pizza. The waitress was 100 and we waited an hour and a half for them to forget our order. Ridiculous prices for the service and the terrible food!
Prediction: [[0.9995939135551453, 0.00040609337156638503]]
--------------------------------------------------
Review: Like a glutton for punishment, I hit up Roland's again last night while in town for business. I tired Kaya, but the wait was 30 + minutes.\n\nHere's the big clue I keep ignoring....the restaurant is EMPTY when

Make sure to reflect about these models and the differences in their performance.

We can see the baseline model performs the worst, BOW performs very well with a simpler appraoch, while the CNN and the LSTM perform very similary, while also performing well.

The Transformer, with the being trained already, can make the best predicitons of them all.