# Sentiment Prediction

In this last part of the hackathon, we will train a small deep learning model on the IMDB Sentiment dataset.
This dataset contains review texts along with a labelled "positive" or "negative" sentiment.
We will preprocess the data using Polars, which you will implement.
We will then train and evaluate a neural network that can predict a sentiment for an arbitrary input text!

In [9]:
import joblib
import polars as pl
from sklearn.model_selection import train_test_split

## Load the data
First of all we will load the data.
It is always a good idea to visualise a portion of the data and do a few sanity checks before starting the preprocessing.

### Exercise 5.1
#### Exercise 5.1.1
Load the data using polars.
Since in this case we are going to process all the data at once, we can read it in directly rather than scanning and collecting separate parts in our analysis.
However, a golden tip is that it is still always a good idea convert the DataFrame to a LazyFrame to allow the optimization engine to optimize all the queries we define subsequently.

In [10]:

data = pl.read_csv("content/data/imdb_sentiment_dataset/IMDB_Dataset.csv")
data = data.lazy()

#### Exercise 5.1.2
Using only Polars, show the number of records in the dataset, (separately) visualise the first few records and the last few records of the data and show the counts per sentiment value.

In [11]:
display(data.select(pl.len()).collect())
display(data.head().collect())
display(data.tail().collect())
display(data.collect()["sentiment"].value_counts())

len
u32
50000


review,sentiment
str,str
"""One of the other reviewers has…","""positive"""
"""A wonderful little production.…","""positive"""
"""I thought this was a wonderful…","""positive"""
"""Basically there's a family whe…","""negative"""
"""Petter Mattei's ""Love in the T…","""positive"""


review,sentiment
str,str
"""I thought this movie did a dow…","""positive"""
"""Bad plot, bad dialogue, bad ac…","""negative"""
"""I am a Catholic taught in paro…","""negative"""
"""I'm going to have to disagree …","""negative"""
"""No one expects the Star Trek m…","""negative"""


sentiment,count
str,u32
"""negative""",25000
"""positive""",25000


## Preprocessing
In case of a larger dataset, it may be a good idea to use the `tf.Dataset` interface to allow preprocessing portions of the data in parallel with training on the data that has already been preprocessed, which can be accomplished through the `map()` function.
In this case, our dataset is relatively small and we can do the full preprocessing beforehand.

First off, for the training procedure we will want to use integer values for our labels.
In the case of binary values, mapping to 0 and 1 is most commonly used.

### Exercise 5.2
Convert the sentiment column in the data frame by mapping positive reviews to the value 1 and negative reviews to the value 0.
Display the first few records to assure that the conversion was done correctly, then display the value counts to assure that the column now only consists of 0 and 1 values and that the counts are identical to before.

In [12]:
data = data.with_columns(
    pl.col("sentiment").replace({"positive": 1, "negative": 0}).cast(pl.Int8)
)
display(data.head().collect())
display(data.tail().collect())
display(data.collect()["sentiment"].value_counts())

review,sentiment
str,i8
"""One of the other reviewers has…",1
"""A wonderful little production.…",1
"""I thought this was a wonderful…",1
"""Basically there's a family whe…",0
"""Petter Mattei's ""Love in the T…",1


review,sentiment
str,i8
"""I thought this movie did a dow…",1
"""Bad plot, bad dialogue, bad ac…",0
"""I am a Catholic taught in paro…",0
"""I'm going to have to disagree …",0
"""No one expects the Star Trek m…",0


sentiment,count
i8,u32
1,25000
0,25000


### Exercise 5.3
Split the data into two distinct sets of train and test data respectively.
Visualise the output shapes to ensure that the ratio between the two is as expected.


In [13]:

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

2024-10-02 05:53:46.306810: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-10-02 05:53:47.418851: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-10-02 05:53:48.494986: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-02 05:53:49.342815: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-02 05:53:49.553671: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-02 05:53:51.069973: I tensorflow/core/platform/cpu_feature_gu

In [14]:
train_data, test_data = train_test_split(data.collect(), test_size=0.2, random_state=42)
print(train_data.shape)
print(test_data.shape)

(40000, 2)
(10000, 2)


## Tokenize the review texts
An important concept in natural language processing is tokenization.
As individual characters have little meaning and would yield an extremely high-dimensional search space, texts are instead tokenized into common sequences of characters.
For LLM's such as GPT, the tokenization is nowadays commonly done character-based.
However, especially for smaller models it is a good idea to use whole words as tokens.

### Exercise 5.4
Use the keras Tokenizer (already imported for your convenience) to tokenize our input data.
#### Exercise 5.4.1
Instantiate the Tokenizer and fit it on the review texts of our **training data** using the `fit_on_texts()` method.
In this manner, the Tokenizer will determine how to tokenize data based on the provided fitting data.
You can directly pass in the Polars Series object, as it implements the required iterator for said method.

In [15]:
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(train_data["review"])

#### Exercise 5.4.2
Subsequently, convert the reviews from our training and test data to create the final input data (X) to use for training and testing.
Display the resulting arrays to ensure that the text was tokenized and to get a rough idea of the input for our model.

In [16]:
train_x = pad_sequences(tokenizer.texts_to_sequences(train_data["review"]), maxlen=250)
test_x = pad_sequences(tokenizer.texts_to_sequences(test_data["review"]), maxlen=250)
display(train_x)
display(test_x)

array([[  43,   10,   13, ...,  205,  351, 3856],
       [  21,  103,    1, ...,   89,  103,    9],
       [   0,    0,    0, ...,    2,  710,   62],
       ...,
       [   0,    0,    0, ..., 1641,    2,  603],
       [   0,    0,    0, ...,  245,  103,  125],
       [   0,    0,    0, ...,   70,   73, 2062]], dtype=int32)

array([[   0,    0,    0, ...,  995,  719,  155],
       [  12,   59,  196, ...,  380,    7,    7],
       [   0,    0,    0, ...,   50, 1088,   96],
       ...,
       [   0,    0,    0, ...,  125,  200, 3241],
       [   0,    0,    0, ..., 1066,    1, 2305],
       [   0,    0,    0, ...,    1,  332,   27]], dtype=int32)

#### Exercise 5.4.3
Convert our training and test labels to numpy arrays.
Again inspect both arrays to ensure the content is roughly as expected.

In [17]:
train_y = train_data.get_column("sentiment").to_numpy()
test_y = test_data.get_column("sentiment").to_numpy()

In [18]:
train_y

array([0, 0, 1, ..., 0, 1, 1], dtype=int8)

## Defining and compiling the model
Here we build and train the model.
The model we will be using is a small model, and it looks simple because Keras defined a great deal for us.
Although the model itself is outside of the scope of this hackathon, of course I'll give a brief explanation of the different parts.
The Embedding layer defines takes the sparse input data and embeds it into a lower-dimensional space that is easier for our model to work with.
The output of this layer is passed into the LSTM (Long Short-Term Memory) submodel, which is a model consisting of multiple layers internally, designed to learn how to relate sequential data (e.g. words in a sentence).
Finally, the output of the LSTM is passed into a Dense layer with a Sigmoid activation function.
This layer will give a continuous output between 0 and 1.
During the training phase, we will use the loss function ("binary cross-entropy" as defined below) to steer the model towards predicting a score close to 0 for the labels we defined as 0, and 1 for the labels we defined as 1.


In [19]:
model = Sequential()
model.add(Embedding(input_dim=5000, output_dim=128, input_length=250))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation="sigmoid"))

model.summary()

model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])



## Train the model

In [20]:
model.fit(train_x, train_y, epochs=5, batch_size=64, validation_split=0.2)

Epoch 1/5
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m97s[0m 191ms/step - accuracy: 0.7033 - loss: 0.5524 - val_accuracy: 0.8259 - val_loss: 0.3953
Epoch 2/5
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m98s[0m 196ms/step - accuracy: 0.8267 - loss: 0.3956 - val_accuracy: 0.8579 - val_loss: 0.3499
Epoch 3/5
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m104s[0m 208ms/step - accuracy: 0.7981 - loss: 0.4447 - val_accuracy: 0.8541 - val_loss: 0.3532
Epoch 4/5
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m101s[0m 203ms/step - accuracy: 0.8563 - loss: 0.3431 - val_accuracy: 0.8652 - val_loss: 0.3246
Epoch 5/5
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m102s[0m 204ms/step - accuracy: 0.8900 - loss: 0.2661 - val_accuracy: 0.8831 - val_loss: 0.2804


<keras.src.callbacks.history.History at 0x7f890231c140>

## Save the model and tokenizer
We save the model and tokenizer to be able to load them for a different purpose or in a later phase.
For the remainder of this notebook, however, we will keep using the instances we have here.

In [21]:
model.save("model.h5")
joblib.dump(tokenizer, "tokenizer.pkl")



['tokenizer.pkl']

## Evaluate the model

In [22]:
model.evaluate(test_x, test_y)

[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 52ms/step - accuracy: 0.8868 - loss: 0.2740


[0.27364134788513184, 0.887499988079071]

## Use the model to predict the sentiment for a review
### Exercise 5.5
Define a function that takes a review text as input and uses the tokenizer to convert the text to a token sequence, pad it similarly to the input data and finally provide it as input for our model.
Then read the output and convert it to a "positive" or "negative" string.

In [23]:
def predict_sentiment_for_review(review):
    sequences = tokenizer.texts_to_sequences([review])
    padded_sequence = pad_sequences(sequences, maxlen=250)
    prediction = model.predict(padded_sequence)
    return "positive" if prediction[0][0] >= 0.5 else "negative"

In [24]:
predict_sentiment_for_review("I'd rather watch paint dry")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 198ms/step


'negative'

In [25]:
predict_sentiment_for_review("The best movie I've ever seen")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 45ms/step


'positive'

In [26]:
predict_sentiment_for_review("Absolutely stunning!")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 40ms/step


'positive'

In [27]:
predict_sentiment_for_review("Even The Titanic was better...")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 44ms/step


'negative'

In [28]:
predict_sentiment_for_review("sucks")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 42ms/step


'negative'