# Neural Network Basics


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/datasci-w266/2025-spring-main/blob/master/assignment/a1/NNBasics.ipynb)

### Brief Review of Machine Learning

In supervised learning, parametric models are those where the model is a function of a fixed form with a number of unknown _parameters_.  Together with a loss function and a training set, an optimizer can select parameters to minimize the loss with respect to the training set.  Common optimizers include stochastic gradient descent.  It tweaks the parameters slightly to move the loss "downhill" due to a small batch of examples from the training set.

## Part A:  Linear & Logistic Regression

You've likely seen linear regression before.  In linear regression, we fit a line (technically, hyperplane) that predicts a target variable, $y$, based on some features $x$.  The form of this model is affine (even if we call it "linear"):  

$$y_{hat} = xW + b$$

where $W$ and $b$ are weights and an offset, respectively, and are the parameters of this parametric model.  The loss function that the optimizer uses to fit these parameters is the squared error ($||\cdots||_2$) between the prediction and the ground truth in the training set.

You've also likely seen logistic regression, which is tightly related to linear regression.  Logistic regression also fits a line - this time separating the positive and negative examples of a binary classifier.  The form of this model is similar:

$$y_{hat} = \sigma(xW + b)$$

where again $W$ and $b$ are the parameters of this model, and $\sigma$ is the [sigmoid function](https://en.wikipedia.org/wiki/Sigmoid_function) which maps un-normalized scores ("logits") to values $\hat{y} \in [0,1]$ that represent probabilities. The loss function that the optimizer uses to fit these parameters is the [cross entropy](../../materials/lesson_notebook/lesson_1_NN_Review.ipynb) between the prediction and the ground truth in the training set.

This pattern of an affine transform, $xW + b$, occurs over and over in machine learning.

**We'll use logistic regression as our running example for the rest of this part.**


### Short Answer Questions

Imagine you want to implement logistic regression:

* `z = xW + b`
* `y_hat = sigmoid(z)`

Where:
1.  `x` is an 11-dimensional feature vector
2.  `W` is the weight vector
3.  `b` is the bias term

What are the dimensions of `W` and `b`?  Recall that in logistic regression, `z` is just a scalar (commonly referred to as the "logit").

Sketch a picture of the whole equation using rectangles to illustrate the dimensions of `x`, `W`, and `b`.  See examples below for inspiration (though please label each dimension).  We don't ask you to submit this, but make sure you can do it!  It's the "print" debugging statement of neural networks!  It's also useful for reading papers... if you can't draw the shapes of all the tensors, you don't (yet) know what's going on!

## Part B: Batching

Let's say we want to perform inference using your model (parameters `W` and `b`) above on multiple examples instead of just one. On modern hardware (especially GPUs), we can do this efficiently by *batching*.

To do this, we stack up the feature vectors in x like in the diagram below.  Note that changing the number of examples you run on (i.e. your batch size) *does not* affect the number of parameters in your model.  You're just running the same thing in parallel (instead of running the above one feature vector at a time at a time).

![](https://github.com/datasci-w266/2025-spring-main/blob/master/assignment/a1/batchaffine.png?raw=1)

The red (# features) and blue (batch size) lines represent dimensions that are the same.

### Short Answer Questions

If we have 11 features and running the model in parallel with 30 examples, what are the dimensions of:

1. `W` ?
2. `b` ?
3. `x` ?
4. `z` ?

_Hint:_ remember that your model parameters stay fixed!

## Part C: Logistic Regression - NumPy Implementation

In this section, we'll implement logistic regression by hand and compute a few values to make sure we understand what's going on!

Let's say your model has the following parameters:

In [1]:
import numpy as np

W = np.array([45,6,3,25,-1])
b = 5

If you want to run the model on the following three examples:

* [1, 2, 3, 4, 5]
* [0, 0, 0, 0, 5]
* [-3, -4, -12, -1, 1]

Construct the x matrix **such that you compute the answer all in one big batch** and compute the probability of the positive class for each.

In [2]:
# Import sigmoid.
from scipy.special import expit as sigmoid

# Parameters
W = np.array([45, 6, 3, 25, -1])
b = 5

# Input examples stacked as rows in a matrix (batch)
x = np.array([
    [1, 2, 3, 4, 5],
    [0, 0, 0, 0, 5],
    [-3, -4, -12, -1, 1]
])

# Calculate logits (z = xW + b)
z = np.dot(x, W) + b

# Apply sigmoid to get probabilities
y_hat = sigmoid(z)

# Print results
print("Logits (z):", z)
print("Probabilities (y_hat):", y_hat)

### END YOUR CODE

Logits (z): [ 166    0 -216]
Probabilities (y_hat): [1.00000000e+00 5.00000000e-01 1.55737037e-94]


### Short Answer Questions

1. What is the probability of the positive class for the second (middle) example?
# Output: y_hat[1]

2. What is the cross-entropy loss in Base 2 of the second example if its label is positive?
# Formula:
−
log
⁡
(
𝑦
hat
)
−log(y
hat
​
 ) for positive class


## Part D: NumPy Feed Forward Neural Network

Let's do the same procedure for a simple feed-forward neural network.

Imagine you have a 3 layer network (hint: # of affines = # of layers. The affine is the W + b part of a layer).  Each hidden layer is size 10.  Just like before, you've already trained your model and you just want to run it forward.  For this exercise, let's say that each weight matrix is np.ones(...) and each bias term is [-1, -2, -3, ..., -n] if the bias term is $n$ long.  Compute the probability of the positive class for the three examples above, again in a single batch.

**Hint:  Draw the shapes of the matrices at each layer out on a piece of paper!  Include it with any questions you post to Ed Discussion.**

Assume your model uses a sigmoid as the nonlinearity for all layers.

In [3]:
# Input examples
x = np.array([
    [1, 2, 3, 4, 5],
    [0, 0, 0, 0, 5],
    [-3, -4, -12, -1, 1]
])

# Initialize parameters for the 3-layer network
# Layer 1
W1 = np.ones((5, 10))
b1 = np.array([-1] * 10)

# Layer 2
W2 = np.ones((10, 10))
b2 = np.array([-2] * 10)

# Layer 3
W3 = np.ones((10, 1))
b3 = np.array([-3])

# Forward pass
z1 = np.dot(x, W1) + b1
a1 = sigmoid(z1)

z2 = np.dot(a1, W2) + b2
a2 = sigmoid(z2)

z3 = np.dot(a2, W3) + b3
y_hat = sigmoid(z3)

# Print the final probabilities
print("Final Probabilities (y_hat):", y_hat)

### END YOUR CODE

Final Probabilities (y_hat): [[0.99908589]
 [0.99908529]
 [0.14088356]]


### Short Answer Questions

1.  What is the probability of the third example?
y_hat[2]

2.  What is the cross-entropy loss if its label is negative? 0.152


## Part E: Softmax

Recall that softmax(z) is a vector with the same length as z, and whose components are:  $softmax(z)_i = \frac{e^{z_i}}{\Sigma_j e^{z_j}}$.

### Short Answer Questions

1. If the logits coming from the main body of the network are [4, 6, 8], what is the probability of the middle class? 0.117
2. What is the cross-entropy loss if the correct class is the last one? (i.e. corresponding to logit=8)? 0.143
3. If you had such a three-class classification problem, what would the dimensions of W and b be for the last layer of the feed forward neural network above?
W: (hidden_layer_size, 3) (where hidden_layer_size is the number of neurons in the previous layer).
b: (3,) (one bias for each class).

In [4]:
# Logits
logits = np.array([4, 6, 8])

# Compute softmax
softmax = np.exp(logits) / np.sum(np.exp(logits))
print("Softmax Probabilities:", softmax)

# Probability of the middle class
print("Probability of the middle class:", softmax[1])

# Cross-entropy loss for the last class (logit=8)
loss = -np.log(softmax[2])
print("Cross-Entropy Loss for correct class:", loss)


Softmax Probabilities: [0.01587624 0.11731043 0.86681333]
Probability of the middle class: 0.11731042782619837
Cross-Entropy Loss for correct class: 0.1429316284998995
