# Experiment with the bag-of-words model

__Objective:__ classify some documents (movie reviews) using a simple bag-of-words model.

Source: https://pyimagesearch.com/2022/07/04/introduction-to-the-bag-of-words-bow-model/

In [None]:
import pandas as pd
from tensorflow.keras.preprocessing.text import Tokenizer
import tensorflow as tf
from tensorflow.keras import Model, Input
from tensorflow.keras.layers import Dense

## Read data

In [None]:
reviews_data = pd.read_csv('./star_wars_ep_9_reviews_rotten_tomatoes.csv', sep=';', names=['text', 'stars'])

reviews_data

Quantize the score (stars) to get a classification problem (0: bad, 1: good).

In [None]:
reviews_data['quantized_score'] = reviews_data['stars'].apply(lambda x: 1. if x >= 2.5 else 0.)

In [None]:
reviews_data

## Text tokenization

We use the tokenizer to obtain the word counts for each document. These will be used as our feature vectors.

In [None]:
tokenizer = Tokenizer(lower=True)

tokenizer.fit_on_texts(reviews_data['text'])

In [None]:
word_counts = tokenizer.texts_to_matrix(reviews_data['text'], mode='count')

word_counts

## Build and train a model

Build model.

In [None]:
# Define the inputs to the model.
inputs = Input(word_counts.shape[-1])

# Define the outputs of the model with Keras'
# functional API.
x = Dense(units=64, activation='relu')(inputs)
x = Dense(units=32, activation='relu')(x)
outputs = Dense(units=1, activation='sigmoid')(x)

# Define the Model object.
model = Model(
    inputs=inputs,
    outputs=outputs
)

# Compile the model.
model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics='accuracy'
)

Train model.

In [None]:
x = tf.constant(word_counts)
y = tf.constant(reviews_data['quantized_score'])

model.fit(
    x=x,
    y=y,
    epochs=3,
    batch_size=None
)

Test prediction (on training data...).

In [None]:
model(x)