# A Simple Feed Forward Neural Network from Scratch - Sentiment Analysis

© 2025 by [Damir Cavar](http://damir.cavar.me/)


**Prerequisites:**

We will need to install the following modules:

In [None]:
!pip install -U numpy
!pip install -U nltk
!pip install -U scipy

We will need the following modules and some specfic functions from some of these modules:

In [1]:
import os
import csv
from nltk.tokenize import word_tokenize
from collections import Counter
import math
import numpy as np
from sklearn.model_selection import train_test_split

## Sentiment Analysis Example

In the following we will use the reviews data that we used in the Logistic Regression section in notebook *Scikitlearn_logistic_regression_2*.

Load the Reviews corpus:

In [2]:
experiment_data = []
with open(os.path.join('.', 'data', 'reviews.csv'), newline='') as csvfile:
    datareader = csv.reader(csvfile, delimiter=',', quotechar='"')
    header = next(datareader)
    for row in datareader:
        if len(row) == 2:
            experiment_data.append( [row[0].strip(), int(row[1].strip())] )

In [3]:
print(experiment_data[:2])
print("Text:\n", experiment_data[0][0])
print("Value:\n", experiment_data[0][1])
print("Number of records:", len(experiment_data))

[["Once again Mr. Costner has dragged out a movie for far longer than necessary. Aside from the terrific sea rescue sequences, of which there are very few I just did not care about any of the characters. Most of us have ghosts in the closet, and Costner's character are realized early on, and then forgotten until much later, by which time I did not care. The character we should really care about is a very cocky, overconfident Ashton Kutcher. The problem is he comes off as kid who thinks he's better than anyone else around him and shows no signs of a cluttered closet. His only obstacle appears to be winning over Costner. Finally when we are well past the half way point of this stinker, Costner tells us all about Kutcher's ghosts. We are told why Kutcher is driven to be the best with no prior inkling or foreshadowing. No magic here, it was all I could do to keep from turning it off an hour in.", 0], ["This is an example of why the majority of action films are the same. Generic and boring,

Load the Vader lexicon for lexical sentiment analysis:

In [4]:
sentiment_dictionary = {}
with open(os.path.join('.', 'data', 'vader_lexicon.txt'), mode='r', encoding='utf-8') as ifile:
    lines = ifile.readlines()
    sentiment_dictionary = { y[0]: y[1] for y in [ x.split('\t') for x in lines ] if len(y) == 4 }

In [5]:
print(sentiment_dictionary["adventurer"])

1.2


Define the set of 1st and 2nd person pronouns:

In [6]:
pronouns = {"i", "me", "my", "mine", "you", "yours", "yourself", "myself", "we", "us", "our", "ours", "ourselves"}

The vectorization function for the review texts is:

In [7]:
def vectorizer(text: str) -> list:
    tokens = word_tokenize(text.lower())
    scores = [ float(sentiment_dictionary.get(t, 0.0)) for t in tokens ]
    positive = len([ s for s in scores if s > 0 ])
    negative = len([ s for s in scores if s < 0 ])
    if "no" in tokens:
        no_present = 1
    else:
        no_present = 0
    counts = Counter(tokens)
    pronoun_count = 0
    for x in set(counts.keys()).intersection(pronouns):
        pronoun_count += counts[x]
    if "!" in tokens:
        exclamation = 1
    else:
        exclamation = 0
    return [positive, negative, no_present, pronoun_count, exclamation, math.log(len(tokens))]

Generate the vectors for all the corpus texts:

In [8]:
X = [ np.array(vectorizer(e[0])) for e in experiment_data ]

In [9]:
print(len(X), len(X[0]))
print(X[1], "\n", experiment_data[1], "\n", experiment_data[1][0], "\n", experiment_data[1][1])

50000 6
[13.          8.          0.          3.          0.          5.58349631] 
 ["This is an example of why the majority of action films are the same. Generic and boring, there's really nothing worth watching here. A complete waste of the then barely-tapped talents of Ice-T and Ice Cube, who've each proven many times over that they are capable of acting, and acting well. Don't bother with this one, go see New Jack City, Ricochet or watch New York Undercover for Ice-T, or Boyz n the Hood, Higher Learning or Friday for Ice Cube and see the real deal. Ice-T's horribly cliched dialogue alone makes this film grate at the teeth, and I'm still wondering what the heck Bill Paxton was doing in this film? And why the heck does he always play the exact same character? From Aliens onward, every film I've seen with Bill Paxton has him playing the exact same irritating character, and at least in Aliens his character died, which made it somewhat gratifying... Overall, this is second-rate action t

Generate the truth values for the data set:

In [10]:
y = np.array([ float(e[1]) for e in experiment_data ]).reshape(-1, 1)

In [11]:
print(y)
print(len(y))

[[0.]
 [0.]
 [0.]
 ...
 [1.]
 [1.]
 [1.]]
50000


In [13]:
X = np.array(X)
print(X)

[[ 9.          9.          1.          8.          0.          5.23110862]
 [13.          8.          0.          3.          0.          5.58349631]
 [ 6.         14.          1.          5.          0.          5.49306144]
 ...
 [22.         18.          0.         12.          0.          6.14632926]
 [11.          0.          0.         12.          0.          5.32300998]
 [11.          7.          1.          1.          1.          5.27299956]]


We will need the following functions for inferencing and backpropagation:

In [14]:
def sigmoid(z):
	"""Sigmoid activation function"""
	return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

def sigmoid_derivative(z):
	"""Derivative of sigmoid function"""
	s = sigmoid(z)
	return s * (1 - s)

def relu(z):
	"""ReLU activation function"""
	return np.maximum(0, z)

def relu_derivative(z):
	"""Derivative of ReLU function"""
	return (z > 0.0).astype(float)

def cross_entropy_loss(y_true, y_pred):
	"""
	Input:
		y_true: True labels   
		y_pred: Predicted probabilities
	Output:
		loss: Cross-entropy loss
	"""
	m = y_true.shape[0]
	epsilon = 1e-15  # for numerical stability - no log(0)
	y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
	loss = -(1/m) * np.sum(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
	return loss

Initialize the variables and weights for the network:

In [16]:
print(X[0])
num_features = len(X[0])
num_rows = 3
W = np.random.rand(num_rows, num_features)
print("W:\n", W)
b = np.random.rand(num_rows)
print("b:\n", b)
#print("W+b:\n", relu(W @ x[0] + b) )

[9.         9.         1.         8.         0.         5.23110862]
W:
 [[0.99217103 0.07293832 0.46161288 0.9837251  0.13113083 0.51543797]
 [0.8168424  0.46521118 0.94027466 0.89042465 0.35039789 0.45721929]
 [0.53580728 0.81325612 0.55401231 0.55926888 0.98531595 0.79650713]]
b:
 [0.80022043 0.11151789 0.44289653]


The output layer receives a *num-rows*-dimentional vector and generates one output z-score. Initialize the output layer of the network:

In [17]:
U = np.random.rand(1, num_rows)
print("U:\n", U)
bu = np.random.rand(1)
print("bu:\n", bu)
test = W @ X[0] + b
print(test)
print(sigmoid(U @ test + bu))

U:
 [[0.51257843 0.57473851 0.04378136]]
bu:
 [0.39124699]
[21.41393034 22.10543579 21.77924589]
[1.]


The training data set in the following code is the entire data set. As an exercise use some function to randomly select a portion for training and another portion for testing. Use only the true randomly selected training data set for the following code. This code will run a complete inferencing cycle through the training data set, which is in this case the entire data set. Change the code and set up a real evaluation of training and testing, and computing of the F1-score.

In [18]:
results = []
for input_vector, truth in zip(X, y):
    z = (W @ input_vector) + b
    a = relu(z)
    c = sigmoid((U @ a) + bu)
    results.append( (truth, c) )

In [19]:
print("# of results:", len(results))

# of results: 50000


In [20]:
counting_true_negatives = 0
counting_true_positives = 0
counting_false_positives = 0
counting_false_negatives = 0
for res in results:
    # print(res[0], res[1][0])
    if res[0] == 1:
        if res[1][0] >= 0.5:
            counting_true_positives += 1
        else:
            counting_false_negatives += 1
    else:
        if res[1][0] < 0.5:
            counting_true_negatives += 1
        else:
            counting_false_positives += 1
print("True positives:", counting_true_positives)
print("True negatives:", counting_true_negatives)
print("False positives:", counting_false_positives)
print("False negatives:", counting_false_negatives)

True positives: 25000
True negatives: 0
False positives: 25000
False negatives: 0


## Training the Model

The following FFNN class uses:

**Xavier/Glorot initialization**

This makes sure that the weights are samples from a normal or uniform distribution with the standard deviation scaled based on the number of input and output neurons in the layer. This can prevent vanishing or exploding gradients by ensuring the variance of activations and gradients to remain constant across all layers, which allows for more stable and efficient training.



In [21]:
class FFNN:
    """A simple feedforward neural network with one hidden layer."""

    def __init__(self, input_dim=6, hidden_dim=8, learning_rate=0.01):
        """
        Initialize a FFNN:
        """
        self.lr = learning_rate

        # Initialize weights with Xavier/Glorot initialization
        self.W1 = np.random.randn(input_dim, hidden_dim) * np.sqrt(2.0 / input_dim)
        self.b1 = np.zeros((1, hidden_dim))
        
        self.W2 = np.random.randn(hidden_dim, 1) * np.sqrt(2.0 / hidden_dim)
        self.b2 = np.zeros((1, 1))

    def forward_pass(self, X):
        """
        Forward ppass / inferencing
        Input: X: = input data, the shape is defined by the batch_size and the input_dim (= feature vector length)
        Return: Predictions as a single vector, the shape is batch_size and 1 (= one scalar per input sample)
        """
        self.z1 = np.dot(X, self.W1) + self.b1
        self.a1 = relu(self.z1)
        self.z2 = np.dot(self.a1, self.W2) + self.b2
        self.a2 = sigmoid(self.z2)
        return self.a2
    
    def backward_pass(self, X, y, output):
        """
        Backward pass
        Inputs:
            X = input vectors, the shape is the batch_size times the input_dim (feature vector length)
            y = True labels of shape (batch_size, 1)
            output = Predicted values of shape (batch_size, 1)
        """

        m = X.shape[0]  # get the batch size
        # Compute gradients for the output layer:
        # - using binary cross-entropy loss with sigmoid output
        dz2 = output - y  # derivative of loss w.r.t z2
        dW2 = (1/m) * np.dot(self.a1.T, dz2)
        db2 = (1/m) * np.sum(dz2, axis=0, keepdims=True)

        # Compute gradients for hidden layer
        da1 = np.dot(dz2, self.W2.T)
        dz1 = da1 * relu_derivative(self.z1)
        dW1 = (1/m) * np.dot(X.T, dz1)
        db1 = (1/m) * np.sum(dz1, axis=0, keepdims=True)

        # Store gradients
        self.dW1 = dW1
        self.db1 = db1
        self.dW2 = dW2
        self.db2 = db2

    def update_parameters(self):
        """Update parameters using gradient descent."""
        self.W1 -= self.lr * self.dW1
        self.b1 -= self.lr * self.db1
        self.W2 -= self.lr * self.dW2
        self.b2 -= self.lr * self.db2

    def train(self, X, y, epochs=1000, batch_size=32, verbose=True):
        """
        Training loop
        Inputs:
            X: Training data, the shape is n_samples and input_dim (= feature vector length)
            y: Training labels, the shape is n_samples and 1 (= one scalar per input sample)
            epochs: Number of training epochs, default 1,000
            batch_size: Size of mini-batches, default 32
            verbose: Whether to print loss, default True
        Outputs:
            losses: List of loss values per epoch
        """
        n_samples = X.shape[0]
        losses = []
        
        for epoch in range(epochs):
            # Shuffle data
            indices = np.random.permutation(n_samples)
            X_shuffled = X[indices]
            y_shuffled = y[indices]

            epoch_loss = 0
            n_batches = 0

            # Mini-batch gradient descent
            for i in range(0, n_samples, batch_size):
                X_batch = X_shuffled[i:i+batch_size]
                y_batch = y_shuffled[i:i+batch_size]

                # Forward pass
                output = self.forward_pass(X_batch)

                # Compute loss
                batch_loss = cross_entropy_loss(y_batch, output)
                epoch_loss += batch_loss
                n_batches += 1

                # Backward pass with parameter updates after each batch
                self.backward_pass(X_batch, y_batch, output)
                self.update_parameters()

            # Average loss for epoch
            avg_loss = epoch_loss / n_batches
            losses.append(avg_loss)

            if verbose and (epoch % 100 == 0 or epoch == epochs - 1):
                print(f"Epoch {epoch}/{epochs}, Loss: {avg_loss:.6f}")

        return losses

    def predict(self, X):
        """Make predictions on new data"""
        return self.forward_pass(X)

This is a brute force method to split the data set into a training and test data set:

In [None]:
split = int(0.8 * len(X))
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]

The better method is to use a random selection of the training and test set. Each time we call the *scikit learn* function *train_test_split* we will get a new randomly selected data set. We can repeat the experiment using a 10-fold rotation to average the error and measure the F1-score, by 10 times randomly selecting new training and test data sets.

In [22]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

We can create a model using our FFNN class above:

In [27]:
model = FFNN(input_dim=(len(X[0])), hidden_dim=8, learning_rate=0.01)

We train the network parameters using the training data using 500 epochs and a batch size of 32. This means, the network will see each training example 500 times. Every sample of the data is processed by the network once in an epoch. The processing is iterative by batches. A batch could be the entire data set or subsets of the data in smaller batches. The training method returns the losses by epoch in a list.

In [28]:
losses = model.train(X_train, y_train, epochs=1000, batch_size=32)

Epoch 0/1000, Loss: 0.652037
Epoch 100/1000, Loss: 0.549889
Epoch 200/1000, Loss: 0.549259
Epoch 300/1000, Loss: 0.548848
Epoch 400/1000, Loss: 0.548508
Epoch 500/1000, Loss: 0.548304
Epoch 600/1000, Loss: 0.548336
Epoch 700/1000, Loss: 0.548067
Epoch 800/1000, Loss: 0.547589
Epoch 900/1000, Loss: 0.547541
Epoch 999/1000, Loss: 0.547559


The number of epochs is a hyperparameter that can impact the model. With too few epochs the result could be an underfitted model that has not learned enough of the data properties. With too many epochs the result could be an overfitted model that performs poorly on new data.

We evaluate the model using our selected test data set:

In [29]:
y_pred = model.predict(X_test)

The test loss as tje Cross-entropy loss:

In [30]:
test_loss = cross_entropy_loss(y_test, y_pred)

The accuracy:

In [31]:
y_pred_binary = (y_pred > 0.5).astype(float)
accuracy = np.mean(y_pred_binary == y_test)

In [32]:
print(f"Test Loss: {test_loss:.6f}")
print(f"Test Accuracy: {accuracy:.4f}")

Test Loss: 0.552522
Test Accuracy: 0.7195


© 2025 by [Damir Cavar](http://damir.cavar.me/)