(encoding-text)=
# Encoding Text    

Consider the following examples of movie reviews:


Example 1 
* Review 1A (negative) <br/>
I thought the movie would be great, *but it is not*. The script is weak, the pacing is slow, and the ending feels pointless.
* Review 1B (positive) <br/>
I thought the movie would not be great, but it *is*. The script is strong, the pacing is brisk, and the ending feels meaningful.


Example 2
* Review 2A (negative)<br/>
The acting is decent, but the plot is predictable and the jokes fall flat.
* Review 2B (positive)<br/>
The plot is predictable and the jokes fall flat, but the acting is decent.


Example 3
* Review 3A (negative)<br/>
For the first ninety minutes I waited for something exciting to happen. Spoiler: it never does.
* Review 3B (positive)<br/>
For the first ninety minutes I waited for something exciting to happen, and when it finally does it is worth every second.


A problem with a **bag of words** representation treats each review as the same multiset of words: 
$$
vector=(acting:1, plot:1, predictable:1, but:1, \ldots).
$$ 

Because position is discarded, the classifier cannot learn that "not" negates what follows or that the clause after "but" usually carries the main sentiment.
 



In [23]:
import numpy as np
from tensorflow.keras.layers import Embedding

# 1. Toy vocabulary: 5 tokens
#    0 = <PAD>, 1 = "good", 2 = "bad", 3 = "not", 4 = "movie"
vocab_size   = 5
embed_dim    = 3   # just three numbers per word
sequence_len = 4

embed = Embedding(input_dim=vocab_size,
                  output_dim=embed_dim,
                  input_length=sequence_len)

# 2. Example review, integer‑encoded and padded:
#    "not good movie"   →   [3, 1, 4, 0]
sample = np.array([[3, 1, 4, 0]])        # shape (1, 4)

dense_seq = embed(sample)                # shape (1, 4, 3)
print(dense_seq.numpy().round(2))


[[[ 0.01  0.01  0.02]
  [ 0.04  0.03 -0.03]
  [-0.05  0.03 -0.03]
  [ 0.04  0.03 -0.01]]]


## Word Embeddings

Unlike bag‑of‑words, use dense vectors that are trainable and will gradually move so that “good” and “bad” point in different directions, while “not” can flip the meaning when a sequential layer (CNN, RNN) reads the tokens in order.



In [27]:
import numpy as np
from tensorflow.keras.layers import Embedding

# 1. Toy vocabulary: 5 tokens
#    0 = <PAD>, 1 = "good", 2 = "bad", 3 = "not", 4 = "movie"
vocab_size   = 6
embed_dim    = 2   # just three numbers per word
sequence_len = 4

embed = Embedding(input_dim=vocab_size,
                  output_dim=embed_dim,
                  input_length=sequence_len)

# 2. Example review, integer‑encoded and padded:
#    "not good movie"   →   [3, 1, 4, 0]
sample = np.array([[3, 1, 5, 3]])        # shape (1, 4)

dense_seq = embed(sample)                # shape (1, 4, 3)
print(dense_seq.numpy().round(2))


[[[ 0.01 -0.01]
  [-0.    0.03]
  [ 0.03  0.04]
  [ 0.01 -0.01]]]


In [24]:
import tensorflow as tf


## Using a CNN

In [25]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# ------------------------------------------------------------------
# 1. Data loading and preprocessing
# ------------------------------------------------------------------
max_features = 20_000      # keep the 20 000 most frequent words
maxlen        = 400        # cut / pad every review to 400 tokens

(x_train, y_train), (x_test, y_test) = keras.datasets.imdb.load_data(
    num_words=max_features
)

x_train = keras.preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_test  = keras.preprocessing.sequence.pad_sequences(x_test,  maxlen=maxlen)

# ------------------------------------------------------------------
# 2. CNN model
# ------------------------------------------------------------------
model = keras.Sequential([
    layers.Embedding(max_features, 128, input_length=maxlen),

    layers.Conv1D(64, 7, activation="relu"),
    layers.MaxPooling1D(3),

    layers.Conv1D(64, 7, activation="relu"),
    layers.GlobalMaxPooling1D(),

    layers.Dense(1, activation="sigmoid")
])

model.compile(
    optimizer="adam",
    loss="binary_crossentropy",
    metrics=["accuracy"]
)

model.summary()

# ------------------------------------------------------------------
# 3. Training
# ------------------------------------------------------------------
history = model.fit(
    x_train,
    y_train,
    epochs=8,
    batch_size=128,
    validation_split=0.2,
    verbose=0
)

# ------------------------------------------------------------------
# 4. Evaluation
# ------------------------------------------------------------------
test_loss, test_acc = model.evaluate(x_test, y_test, verbose=0)
print(f"Test accuracy: {test_acc:.3f}")

# ------------------------------------------------------------------
# 5. Predictions (first ten test examples, rounded for readability)
# ------------------------------------------------------------------
preds = model.predict(x_test[:10]).round(3).squeeze()
print("Predicted probabilities:", preds)


Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_20 (Embedding)    (None, 400, 128)          2560000   
                                                                 
 conv1d_4 (Conv1D)           (None, 394, 64)           57408     
                                                                 
 max_pooling1d_2 (MaxPoolin  (None, 131, 64)           0         
 g1D)                                                            
                                                                 
 conv1d_5 (Conv1D)           (None, 125, 64)           28736     
                                                                 
 global_max_pooling1d_2 (Gl  (None, 64)                0         
 obalMaxPooling1D)                                               
                                                                 
 dense_2 (Dense)             (None, 1)                

A CNN is great at spotting local phrases like “not good,” but it quickly forgets where it found them; sentiment often hinges on relations far apart in the text, so we need a model that can keep track of information across the whole sequence.

* A Conv1D with kernel size $k=7$ can only look at $7$ consecutive tokens at a time.
* Stacking layers enlarges the receptive field only linearly (first layer sees 7 tokens, two layers see $\approx 19$, etc.).
* Important cues in a review are often dozens of tokens apart ("I hoped it would be great...but it is not.").

On the IMDB dataset, a  CNN typically reaches $~0.88$ accuracy, identical to the bag‑of‑words MLP, then plateaus. Extra filters or layers add parameters but do not solve the fundamental distance problem.

