<a href="https://colab.research.google.com/github/YCCS-Summer-2023-DDNMA/project/blob/83-implementing-first-nn-as-practice-JC/Joseph_Couzens/notebooks/PosNegSentimentNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

To build a sentiment analysis neural network for IMDb movie reviews, you can follow these general steps:

Dataset Preparation:
Download the IMDb movie reviews dataset from Kaggle or any other reliable source.
Preprocess the dataset by cleaning and transforming the text data. This may involve removing special characters, lowercasing the text, and tokenizing the reviews into individual words or subwords.
Split the dataset into training and testing sets. Typically, you reserve a portion of the dataset (e.g., 80%) for training the model and the remaining portion for evaluating its performance.

Text Representation:
Convert the processed text into numerical representations that can be fed into the neural network.
One common approach is to use word embeddings such as Word2Vec or GloVe, which map words to dense vectors. Alternatively, you can utilize techniques like TF-IDF or bag-of-words representation.
Ensure that all the input data has the same shape and length by padding or truncating the sequences as needed.

Model Architecture:
Choose the type of neural network architecture suitable for sentiment analysis, such as a recurrent neural network (RNN), long short-term memory (LSTM), or a 1D convolutional neural network (CNN).
Stack layers and define the structure of your neural network. This can include embedding layers (if not using pre-trained word embeddings), recurrent or convolutional layers, and dense layers.
Experiment with the number of layers, hidden units, and activation functions to find the right balance between model complexity and performance.

Training and Evaluation:
Compile the model by specifying the loss function (e.g., binary cross-entropy) and the optimizer (e.g., Adam or SGD) to use during training.
Train the model on the training dataset using the fit() function or similar methods. Monitor the training process by observing metrics like accuracy and loss.
Validate the model on the testing dataset to evaluate its performance and generalization.
Fine-tune hyperparameters and model architecture as necessary based on the evaluation results.

Prediction:
Use the trained model to predict sentiment on new, unseen movie reviews or user-provided text.
Process the input text in the same way as during the training phase (cleaning, tokenizing, padding, etc.).
Pass the preprocessed text through the trained model and obtain the predicted sentiment.

In [2]:
# Install ml-collections & latest Flax version from Github.
!pip install -q ml-collections git+https://github.com/google/flax

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for ml-collections (setup.py) ... [?25l[?25hdone


In [32]:
from absl import logging
from flax import linen as nn
from flax.metrics import tensorboard
from flax.training import train_state
import jax
import jax.numpy as jnp
import ml_collections
import numpy as np
import optax
import tensorflow_datasets as tfds
import pandas as pd

In [46]:
#function just used for splitting the dataset
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.sequence import pad_sequences
import nltk #to get stop words to avoid for selecting vocab and sentiment words
from nltk.corpus import stopwords
nltk.download('stopwords')


def create_vocabulary(text_data):
  # Convert the NumPy array to a Pandas Series so it can be split like str
  text_series = pd.Series(text_data)
  # Tokenize the text data
  tokenized_text = text_series.str.split()

  # Remove stop words
  stop_words = set(stopwords.words('english'))
  tokenized_text = tokenized_text.apply(lambda x: [word for word in x if word not in stop_words])

  # Perform frequency analysis
  word_frequencies = tokenized_text.explode().value_counts()

  # Manually select sentiment-related words
  sentiment_related_words = ['good', 'bad', 'excellent', 'horrible', 'amazing', 'terrible', 'lacking',
                             'excellent', 'impressive', 'poor', 'interesting', 'well',
                             'great', 'fun', 'worst', 'best', 'lacking', 'funny', 'boring', 'uninteresting', 'not',
                             'amazing','awesome', 'brilliant', 'captivating', 'charming', 'delightful','engaging',
                             'enjoyable', 'entertaining', 'excellent', 'exceptional', 'exciting', 'fantastic', 'fascinating',
                             'heartwarming', 'impressive', 'incredible', 'inspiring',
                             'joyful', 'lovely', 'marvelous', 'masterful', 'outstanding', 'phenomenal', 'pleasing', 'remarkable',
                             'riveting', 'spectacular', 'splendid', 'stunning', 'superb', 'terrific', 'thrilling', 'wonderful',
                             'awful', 'boring', 'disappointing', 'dreadful','dull', 'fails', 'forgettable', 'frustrating', 'lackluster', 'mediocre', 'messy', 'painful', 'poor',
                             'predictable', 'ridiculous', 'slow', 'stupid', 'terrible', 'trite', 'underwhelming', 'uninspiring', 'uninteresting', 'unoriginal', 'weak',]  # Add more as needed

  # Create the vocabulary
  vocabulary = set(sentiment_related_words)

  return vocabulary

def create_bow_vectors(text_data, vocabulary):
  bow_vectors = []
  for document in text_data:
      word_frequency = {word: 0 for word in vocabulary}
      for word in document:
          if word in vocabulary:
              word_frequency[word] += 1
      bow_vectors.append(list(word_frequency.values()))
  return bow_vectors

mx_seq_len = 0 #for later reference in create train state
def get_datasets():
  global mx_seq_len

  file_path = 'IMDB_Dataset.csv'  # Relative file path
  data = pd.read_csv(file_path, error_bad_lines=False)
  train_data, test_data = train_test_split(data, test_size=0.5, random_state=42)

  vocabulary = create_vocabulary(train_data['review'].values)

  train_bow_vectors = create_bow_vectors(train_data['review'].values, vocabulary)
  test_bow_vectors = create_bow_vectors(test_data['review'].values, vocabulary)

  # Determine the maximum sequence length
  max_sequence_length = max(len(vector) for vector in train_bow_vectors + test_bow_vectors)
  mx_seq_len = max_sequence_length

  # Pad the sequences
  train_bow_vectors = pad_sequences(train_bow_vectors, maxlen=max_sequence_length, padding='post', truncating='post')
  test_bow_vectors = pad_sequences(test_bow_vectors, maxlen=max_sequence_length, padding='post', truncating='post')

  train_labels = train_data['sentiment'].values
  test_labels = test_data['sentiment'].values

  #now pass these vectors into the network
  return train_bow_vectors, train_labels, test_bow_vectors, test_labels

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [39]:
#now attempting to apply the CNN from train.py
class sentimentCNN(nn.Module):
  """A BoW sentiment CNN model."""

  @nn.compact
  def __call__(self, x):
    x = nn.Conv(features=32, kernel_size=(3, 3))(x)
    x = nn.relu(x)
    x = nn.avg_pool(x, window_shape=(2, 2), strides=(2, 2))
    x = nn.Conv(features=64, kernel_size=(3, 3))(x)
    x = nn.relu(x)
    x = nn.avg_pool(x, window_shape=(2, 2), strides=(2, 2))
    x = x.reshape((x.shape[0], -1))  # flatten
    x = nn.Dense(features=256)(x)
    x = nn.relu(x)
    x = nn.Dense(features=10)(x)
    return x

@jax.jit
def apply_model(state, input, labels):
  """Computes gradients, loss and accuracy for a single batch."""
  def loss_fn(params):
    logits = state.apply_fn({'params': params}, input)
    one_hot = jax.nn.one_hot(labels, 10)
    loss = jnp.mean(optax.softmax_cross_entropy(logits=logits, labels=one_hot))
    return loss, logits

  grad_fn = jax.value_and_grad(loss_fn, has_aux=True)
  (loss, logits), grads = grad_fn(state.params)
  accuracy = jnp.mean(jnp.argmax(logits, -1) == labels)
  return grads, loss, accuracy

@jax.jit
def update_model(state, grads):
  return state.apply_gradients(grads=grads)

def train_epoch(state, train_ds, train_labels, batch_size, rng):
  """Train for a single epoch."""
  train_ds_size = len(train_ds)
  steps_per_epoch = train_ds_size // batch_size

  perms = jax.random.permutation(rng, len(train_ds))
  perms = perms[:steps_per_epoch * batch_size]  # skip incomplete batch
  perms = perms.reshape((steps_per_epoch, batch_size))

  epoch_loss = []
  epoch_accuracy = []

  for perm in perms:
    batch_vectors = train_ds[perm, ...]
    batch_labels = train_labels[perm, ...]
    grads, loss, accuracy = apply_model(state, batch_vectors, batch_labels)
    state = update_model(state, grads)
    epoch_loss.append(loss)
    epoch_accuracy.append(accuracy)
  train_loss = np.mean(epoch_loss)
  train_accuracy = np.mean(epoch_accuracy)
  return state, train_loss, train_accuracy


def create_train_state(rng, config):
    """Creates initial `TrainState`."""
    cnn = sentimentCNN()
    params = cnn.init(rng, jnp.ones([1, mx_seq_len]))['params']  # Adjust the input shape based on your BoW vector shape
    tx = optax.sgd(config.learning_rate, config.momentum)
    return train_state.TrainState.create(
        apply_fn=cnn.apply, params=params, tx=tx)



def train_and_evaluate(config: ml_collections.ConfigDict, workdir: str) -> train_state.TrainState:
    """Execute model training and evaluation loop.

    Args:
        config: Hyperparameter configuration for training and evaluation.
        workdir: Directory where the tensorboard summaries are written to.

    Returns:
        The train state (which includes the `.params`).
    """
    train_ds, train_labels, test_ds, test_labels = get_datasets()
    rng = jax.random.PRNGKey(0)

    summary_writer = tensorboard.SummaryWriter(workdir)
    summary_writer.hparams(dict(config))

    rng, init_rng = jax.random.split(rng)
    state = create_train_state(init_rng, config)

    for epoch in range(1, config.num_epochs + 1):
        rng, input_rng = jax.random.split(rng)
        state, train_loss, train_accuracy = train_epoch(state, train_ds, train_labels, config.batch_size, input_rng)
        _, test_loss, test_accuracy = apply_model(state, test_ds, test_labels)

        logging.info(
            'epoch:% 3d, train_loss: %.4f, train_accuracy: %.2f, test_loss: %.4f, test_accuracy: %.2f'
            % (epoch, train_loss, train_accuracy * 100, test_loss, test_accuracy * 100))

        summary_writer.scalar('train_loss', train_loss, epoch)
        summary_writer.scalar('train_accuracy', train_accuracy, epoch)
        summary_writer.scalar('test_loss', test_loss, epoch)
        summary_writer.scalar('test_accuracy', test_accuracy, epoch)

    summary_writer.flush()
    return state






In [47]:
#Now lets attempt to actually use the model

config = ml_collections.ConfigDict()
config.num_epochs = 5
config.batch_size = 32
config.learning_rate = 0.001
import os
workdir = os.getcwd()  # Get the current working directory


train_and_evaluate(config, workdir)



  data = pd.read_csv(file_path, error_bad_lines=False)


ValueError: ignored

In [43]:
#Now lets attempt to actually use the model
example_directory = 'examples/mnist'
editor_relpaths = ('configs/default.py')

repo, branch = 'https://github.com/google/flax', 'main'

%load_ext autoreload
%autoreload 2
from configs import default as config_lib
config = config_lib.get_config()


if 'google.colab' in str(get_ipython()):
  %load_ext tensorboard
  %tensorboard --logdir=.


config.num_epochs = 3
models = {}
for momentum in (0.8, 0.9, 0.95):
  name = f'momentum={momentum}'
  config.momentum = momentum
  state = train_and_evaluate(config, workdir=f'./models/{name}')
  models[name] = state.params




The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


ModuleNotFoundError: ignored