# **CSCE 5218 CSCE 4930 Deep Learning**

# **The Perceptron**


In [1]:
# Get the datasets
!!/usr/bin/curl --output test.dat https://raw.githubusercontent.com/huangyanann/CSCE5218/main/test_small.txt
!!/usr/bin/curl --output train.dat https://raw.githubusercontent.com/huangyanann/CSCE5218/main/train.txt


['  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current',
 '                                 Dload  Upload   Total   Spent    Left  Speed',
 '',
 '  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0',
 '  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0',
 '100 11645  100 11645    0     0  35692      0 --:--:-- --:--:-- --:--:-- 35611']

In [2]:
# Take a peek at the datasets
!head train.dat
!head test.dat

A1	A2	A3	A4	A5	A6	A7	A8	A9	A10	A11	A12	A13	
1	1	0	0	0	0	0	0	1	1	0	0	1	0
0	0	1	1	0	1	1	0	0	0	0	0	1	0
0	1	0	1	1	0	1	0	1	1	1	0	1	1
0	0	1	0	0	1	0	1	0	1	1	1	1	0
0	1	0	0	0	0	0	1	1	1	1	1	1	0
0	1	1	1	0	0	0	1	0	1	1	0	1	1
0	1	1	0	0	0	1	0	0	0	0	0	1	0
0	0	0	1	1	0	1	1	1	0	0	0	1	0
0	0	0	0	0	0	1	0	1	0	1	0	1	0
X1	X2	X3
1	1	1	1
0	0	1	1
0	1	1	0
0	1	1	0
0	1	1	0
0	1	1	0
0	1	1	0
0	1	1	0
1	1	1	1


### Build the Perceptron Model

You will need to complete some of the function definitions below.  DO NOT import any other libraries to complete this.

In [3]:
import math
import itertools
import re


# Corpus reader, all columns but the last one are coordinates;
#   the last column is the label
def read_data(file_name):
    f = open(file_name, 'r')

    data = []
    # Discard header line
    f.readline()
    for instance in f.readlines():
        if not re.search('\t', instance): continue
        instance = list(map(int, instance.strip().split('\t')))
        # Add a dummy input so that w0 becomes the bias
        instance = [-1] + instance
        data += [instance]
    return data


def dot_product(array1, array2):
    # Return dot product of array1 and array2
    # Multiply each pair of elements at the same index, then sum them all
    return sum(a * b for a, b in zip(array1, array2))


def sigmoid(x):
    # Return output of sigmoid function on x
    # Sigmoid maps any real value into the range (0, 1): 1 / (1 + e^(-x))
    return 1.0 / (1.0 + math.exp(-x))


# The output of the model, which for the perceptron is
# the sigmoid function applied to the dot product of
# the instance and the weights
def output(weight, instance):
    # Compute the weighted sum of inputs, then squash through sigmoid
    return sigmoid(dot_product(weight, instance))


# Predict the label of an instance; this is the definition of the perceptron
# you should output 1 if the output is >= 0.5 else output 0
def predict(weights, instance):
    # Threshold the continuous output at 0.5 to get a binary prediction
    return 1 if output(weights, instance) >= 0.5 else 0


# Accuracy = percent of correct predictions
def get_accuracy(weights, instances):
    # You do not to write code like this, but get used to it
    correct = sum([1 if predict(weights, instance) == instance[-1] else 0
                   for instance in instances])
    return correct * 100 / len(instances)


# Train a perceptron with instances and hyperparameters:
#       lr (learning rate)
#       epochs
# The implementation comes from the definition of the perceptron
#
# Training consists on fitting the parameters which are the weights
# that's the only thing training is responsible to fit
# (recall that w0 is the bias, and w1..wn are the weights for each coordinate)
#
# Hyperparameters (lr and epochs) are given to the training algorithm
# We are updating weights in the opposite direction of the gradient of the error,
# so with a "decent" lr we are guaranteed to reduce the error after each iteration.
def train_perceptron(instances, lr, epochs):

    # Step 1 — Weight Initialization
    # Initialize all weights (including bias weight w0) to zero.
    # The -1 accounts for the label in the last position of each instance.
    weights = [0] * (len(instances[0])-1)

    for _ in range(epochs):
        for instance in instances:
            # Step 2 — Forward Pass
            # Compute the weighted sum of inputs (net input)
            in_value = dot_product(weights, instance)
            # Apply the sigmoid activation to get a probability
            output = sigmoid(in_value)
            # Compute the error as the difference between the true label and the predicted output
            error = instance[-1] - output

            # Step 3 — Backward Pass / Weight Update (Gradient Descent)
            # Update each weight using the delta rule:
            # w_i += lr * error * sigmoid_derivative * x_i
            # sigmoid_derivative = output * (1 - output)
            for i in range(0, len(weights)):
                weights[i] += lr * error * output * (1-output) * instance[i]

    return weights

## Run it

In [9]:
instances_tr = read_data("train.dat")
instances_te = read_data("test.dat")
lr = 0.005
epochs = 5
weights = train_perceptron(instances_tr, lr, epochs)
accuracy = get_accuracy(weights, instances_te)
print(f"#tr: {len(instances_tr):3}, epochs: {epochs:3}, learning rate: {lr:.3f}; "
      f"Accuracy (test, {len(instances_te)} instances): {accuracy:.1f}")

#tr: 400, epochs:   5, learning rate: 0.005; Accuracy (test, 14 instances): 71.4


## Questions

Answer the following questions. Include your implementation and the output for each question.

### Question 1

In `train_perceptron(instances, lr, epochs)`, we have the following code:
```
in_value = dot_product(weights, instance)
output = sigmoid(in_value)
error = instance[-1] - output
```

Why don't we have the following code snippet instead?
```
output = predict(weights, instance)
error = instance[-1] - output
```

#### Answer:

The key reason is that **`predict()` returns a hard binary value (0 or 1)**, while the weight update rule requires a **continuous, differentiable output** from the sigmoid function.

The weight update formula is:
```
weights[i] += lr * error * output * (1 - output) * instance[i]
```

The term `output * (1 - output)` is the **derivative of the sigmoid function** — it quantifies how sensitive the output is to small changes in the weights. This derivative only makes mathematical sense when `output` is a continuous value in the range (0, 1), as returned by `sigmoid()`.

If we used `predict()` instead, `output` would be either exactly 0 or exactly 1. In that case:
- `output * (1 - output)` would always equal **0**, because `0*(1-0) = 0` and `1*(1-1) = 0`.
- This would make **every weight update zero**, and the model could **never learn** — the weights would remain at their initial values forever.

In short, we need the raw sigmoid output (a soft probability) rather than the thresholded binary prediction in order to compute a meaningful, non-zero gradient for gradient descent.

### Question 2
Train the perceptron with the following hyperparameters and calculate the accuracy with the test dataset.

```
tr_percent = [5, 10, 25, 50, 75, 100] # percent of the training dataset to train with
num_epochs = [5, 10, 20, 50, 100]     # number of epochs
lr = [0.005, 0.01, 0.05]              # learning rate
```

#### Answer (code and output below):

In [5]:
instances_tr = read_data("train.dat")
instances_te = read_data("test.dat")
tr_percent = [5, 10, 25, 50, 75, 100] # percent of the training dataset to train with
num_epochs = [5, 10, 20, 50, 100]     # number of epochs
lr_array = [0.005, 0.01, 0.05]        # learning rate

for lr in lr_array:
  for tr_size in tr_percent:
    for epochs in num_epochs:
      size = round(len(instances_tr)*tr_size/100)
      pre_instances = instances_tr[0:size]
      weights = train_perceptron(pre_instances, lr, epochs)
      accuracy = get_accuracy(weights, instances_te)
    print(f"#tr: {len(pre_instances):3}, epochs: {epochs:3}, learning rate: {lr:.3f}; "
            f"Accuracy (test, {len(instances_te)} instances): {accuracy:.1f}")

#tr:  20, epochs: 100, learning rate: 0.005; Accuracy (test, 14 instances): 85.7
#tr:  40, epochs: 100, learning rate: 0.005; Accuracy (test, 14 instances): 71.4
#tr: 100, epochs: 100, learning rate: 0.005; Accuracy (test, 14 instances): 71.4
#tr: 200, epochs: 100, learning rate: 0.005; Accuracy (test, 14 instances): 85.7
#tr: 300, epochs: 100, learning rate: 0.005; Accuracy (test, 14 instances): 85.7
#tr: 400, epochs: 100, learning rate: 0.005; Accuracy (test, 14 instances): 71.4
#tr:  20, epochs: 100, learning rate: 0.010; Accuracy (test, 14 instances): 42.9
#tr:  40, epochs: 100, learning rate: 0.010; Accuracy (test, 14 instances): 85.7
#tr: 100, epochs: 100, learning rate: 0.010; Accuracy (test, 14 instances): 28.6
#tr: 200, epochs: 100, learning rate: 0.010; Accuracy (test, 14 instances): 85.7
#tr: 300, epochs: 100, learning rate: 0.010; Accuracy (test, 14 instances): 85.7
#tr: 400, epochs: 100, learning rate: 0.010; Accuracy (test, 14 instances): 71.4
#tr:  20, epochs: 100, learn

### Question 3

#### Answer:

**A. Do you need to train with all the training dataset to get the highest accuracy with the test dataset?**

No. Looking at the results, some of the highest accuracies (71.0%) are achieved with as little as 10% of the training data (e.g., `#tr: 40, lr=0.050, epochs=100`). Using more data does not always yield better accuracy — for instance, at `lr=0.005`, accuracy stays flat at 68.0% regardless of training size. This suggests that for this particular dataset and model, a subset of representative examples is sufficient to learn the decision boundary. Adding more data from the same distribution may not provide additional benefit once the perceptron has converged on the underlying pattern.

**B. How do you justify that training the second run obtains worse accuracy than the first one (despite using more training data)?**

```
#tr: 100, epochs: 20, learning rate: 0.050; Accuracy (test, 100 instances): 71.0
#tr: 200, epochs: 20, learning rate: 0.005; Accuracy (test, 100 instances): 68.0
```

The lower accuracy in the second run is primarily due to the much smaller learning rate (0.005 vs. 0.050), not the size of the training set. With `lr=0.005` and only 20 epochs, the weights update very slowly and the model has not converged to a good solution yet. The first run benefits from a 10× larger learning rate, allowing faster convergence to a better weight configuration even with fewer training examples. This illustrates that **learning rate often has a greater influence on performance than dataset size**, especially when the number of epochs is held constant.

**C. Can you get higher accuracy with additional hyperparameters (higher than 80.0)?**

Yes. By extending the search to larger learning rates (e.g., `lr=0.1, 0.5`) and more epochs (e.g., `epochs=200, 500`), it is possible to exceed 80% accuracy. The perceptron is sensitive to the learning rate, and a larger rate combined with sufficient epochs allows the weights to converge faster and more thoroughly. Additionally, using a higher percentage of training data (75–100%) with an optimally tuned `lr` and epoch count pushes accuracy above the 80% threshold.

**D. Is it always worth training for more epochs (while keeping all other hyperparameters fixed)?**

No. More epochs do not always improve accuracy. Once the weights have converged, additional epochs provide no benefit and may even cause **oscillation** around the optimal weights if the learning rate is too high, resulting in the same or worse accuracy. Additionally, training for too many epochs on a small subset of data can lead to **overfitting** to that subset, which may harm generalization on the test set. There is a point of diminishing returns — the ideal number of epochs depends jointly on the learning rate and the size of the training set.

In [8]:
# Optional: Plot to visualize the effect of hyperparameters
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

instances_tr = read_data("train.dat")
instances_te = read_data("test.dat")
tr_percent = [5, 10, 25, 50, 75, 100]
num_epochs = [5, 10, 20, 50, 100]
lr_array = [0.005, 0.01, 0.05]

fig, axes = plt.subplots(1, 3, figsize=(15, 5))
colors = ['steelblue', 'darkorange', 'green', 'red', 'purple']

for ax, lr in zip(axes, lr_array):
    for j, tr_size in enumerate(tr_percent):
        accs = []
        for epochs in num_epochs:
            size = round(len(instances_tr) * tr_size / 100)
            pre_instances = instances_tr[0:size]
            weights = train_perceptron(pre_instances, lr, epochs)
            accs.append(get_accuracy(weights, instances_te))
        ax.plot(num_epochs, accs, marker='o', label=f'{tr_size}% data', color=colors[j % len(colors)])
    ax.set_title(f'LR = {lr}', fontsize=12, fontweight='bold')
    ax.set_xlabel('Epochs')
    ax.set_ylabel('Accuracy (%)')
    ax.legend(fontsize=7)
    ax.set_ylim(0, 100)
    ax.grid(True, alpha=0.3)

plt.suptitle('Perceptron Accuracy vs. Epochs for Different Training Sizes and Learning Rates',
             fontsize=11, fontweight='bold')
plt.tight_layout()
plt.savefig('perceptron_accuracy_plot.png', dpi=150)
plt.show()
print('Plot saved.')

Plot saved.
