Here’s an updated version of the explanation and implementation using the latest PyTorch updates:

---

Before introducing PyTorch, we will first implement a neural network using **NumPy**. 

NumPy provides a powerful **n-dimensional array object** and many functions for manipulating these arrays. However, it doesn't handle **computation graphs**, **deep learning**, or **gradients**. Despite this, we can use NumPy to implement a basic two-layer network by manually coding the forward and backward passes:

```python
# Code: two_layer_net_numpy.py
import numpy as np

# Define network parameters
N, D_in, H, D_out = 64, 1000, 100, 10  # N: batch size, D_in: input dim, H: hidden dim, D_out: output dim

# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

learning_rate = 1e-6
for t in range(500):
    # Forward pass
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)

    # Compute and print loss
    loss = np.square(y_pred - y).sum()
    print(t, loss)

    # Backward pass to compute gradients
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0
    grad_w1 = x.T.dot(grad_h)

    # Update weights
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2
```

### PyTorch: Tensors

**NumPy** lacks the ability to utilize GPUs for computation acceleration, which is critical for deep learning. **PyTorch** fills this gap with **Tensors**, which are similar to NumPy arrays but can leverage GPUs.

With the latest PyTorch updates, we can take advantage of improved features like automatic mixed precision and device management. Here's an updated version using PyTorch:

```python
# Code: two_layer_net_pytorch.py
import torch

# Set device (CUDA for GPU if available)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Network parameters
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = torch.randn(N, D_in, device=device)
y = torch.randn(N, D_out, device=device)

# Initialize weights with requires_grad=True to enable autograd
w1 = torch.randn(D_in, H, device=device, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, requires_grad=True)

learning_rate = 1e-6

for t in range(500):
    # Forward pass
    h = x.mm(w1)
    h_relu = torch.relu(h)  # ReLU is now directly supported
    y_pred = h_relu.mm(w2)

    # Compute loss
    loss = torch.nn.functional.mse_loss(y_pred, y)  # Built-in MSE loss
    print(t, loss.item())

    # Backward pass (automatic with PyTorch autograd)
    loss.backward()

    # Update weights using torch.no_grad() to avoid tracking this in autograd
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad

        # Manually zero the gradients after each step
        w1.grad.zero_()
        w2.grad.zero_()
```

### Latest Features Used:

1. **Device management**: The code automatically switches between CPU and GPU (`torch.device("cuda" if torch.cuda.is_available() else "cpu")`).
2. **Built-in operations**: ReLU (`torch.relu`) and MSE Loss (`torch.nn.functional.mse_loss`) simplify the implementation.
3. **Autograd**: PyTorch now fully automates the backward pass, so there's no need to manually compute gradients as with NumPy.
4. **Memory Efficiency**: `torch.no_grad()` is used to prevent unnecessary graph tracking during the weight update step.

### PyTorch: Automatic Mixed Precision (AMP)
In the latest PyTorch versions, **Automatic Mixed Precision (AMP)** allows you to speed up training by using both FP16 and FP32 precision during computations. Here's how you can add AMP:

```python
# Code: two_layer_net_amp.py
import torch
from torch.cuda.amp import autocast, GradScaler

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

N, D_in, H, D_out = 64, 1000, 100, 10

x = torch.randn(N, D_in, device=device)
y = torch.randn(N, D_out, device=device)

w1 = torch.randn(D_in, H, device=device, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, requires_grad=True)

scaler = GradScaler()  # For automatic mixed precision
learning_rate = 1e-6

for t in range(500):
    # Forward pass under autocast for mixed precision
    with autocast():
        h = x.mm(w1)
        h_relu = torch.relu(h)
        y_pred = h_relu.mm(w2)
        loss = torch.nn.functional.mse_loss(y_pred, y)

    print(t, loss.item())

    # Backward pass with scaler
    scaler.scale(loss).backward()

    # Update weights using scaler and no_grad
    with torch.no_grad():
        scaler.step(optimizer)
        scaler.update()

        w1.grad.zero_()
        w2.grad.zero_()
```

This version demonstrates the **GradScaler** and **autocast** functionality, which optimizes performance on GPUs with mixed precision.

With these updates, PyTorch now offers a more efficient and scalable way to build deep learning models, leveraging GPUs, mixed precision, and built-in loss functions and operations.

In PyTorch, defining custom autograd functions allows users to create their own operations, including both forward and backward passes, providing more control over the computational graph. Here's an overview of how this can be achieved and contrasted with TensorFlow's static graph approach, followed by examples of using the `nn` package and optimizers in PyTorch.

### PyTorch: Defining Custom Autograd Functions
Under the hood, each primitive operator in autograd consists of two parts: the forward function, which computes the output tensors from input tensors, and the backward function, which computes gradients of input tensors based on the gradients of output tensors.

To define a custom autograd operator in PyTorch, you can subclass `torch.autograd.Function` and implement both the forward and backward methods. Once the custom function is created, you can use it just like any other PyTorch operation.

#### Example: Custom ReLU Autograd Function
```python
import torch

class MyReLU(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x):
        # Cache the input tensor for the backward pass
        ctx.save_for_backward(x)
        return x.clamp(min=0)

    @staticmethod
    def backward(ctx, grad_output):
        # Retrieve the cached tensor from the context
        x, = ctx.saved_tensors
        grad_x = grad_output.clone()
        grad_x[x < 0] = 0
        return grad_x

# Define input, output dimensions, and random data
N, D_in, H, D_out = 64, 1000, 100, 10
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)
w1 = torch.randn(D_in, H, requires_grad=True)
w2 = torch.randn(H, D_out, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    # Use custom ReLU in forward pass
    y_pred = MyReLU.apply(x.mm(w1)).mm(w2)
    
    # Compute loss
    loss = (y_pred - y).pow(2).sum()
    print(t, loss.item())

    # Perform backward pass
    loss.backward()
    
    # Update weights
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad
        w1.grad.zero_()
        w2.grad.zero_()
```

### TensorFlow: Static Graphs vs PyTorch's Dynamic Graphs
TensorFlow uses a static computation graph, where the graph is defined once and can be optimized before execution. In contrast, PyTorch uses dynamic computation graphs that are defined on-the-fly during each forward pass.

Static graphs in TensorFlow can be optimized upfront, which can be efficient when running the same graph repeatedly. However, dynamic graphs allow for more flexibility, particularly for models with varying computation per input (e.g., recurrent networks with different sequence lengths).

#### Example: TensorFlow Two-Layer Network
```python
import tensorflow as tf
import numpy as np

# Input, output dimensions and placeholders
N, D_in, H, D_out = 64, 1000, 100, 10
x = tf.placeholder(tf.float32, shape=(None, D_in))
y = tf.placeholder(tf.float32, shape=(None, D_out))

# Weight variables
w1 = tf.Variable(tf.random_normal((D_in, H)))
w2 = tf.Variable(tf.random_normal((H, D_out)))

# Forward pass
h = tf.matmul(x, w1)
h_relu = tf.maximum(h, tf.zeros(1))
y_pred = tf.matmul(h_relu, w2)

# Loss and optimizer
loss = tf.reduce_sum((y - y_pred) ** 2)
grad_w1, grad_w2 = tf.gradients(loss, [w1, w2])
learning_rate = 1e-6
new_w1 = w1.assign(w1 - learning_rate * grad_w1)
new_w2 = w2.assign(w2 - learning_rate * grad_w2)

# Execute the graph
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    x_value = np.random.randn(N, D_in)
    y_value = np.random.randn(N, D_out)
    for _ in range(500):
        loss_value, _, _ = sess.run([loss, new_w1, new_w2],
                                    feed_dict={x: x_value, y: y_value})
        print(loss_value)
```

### PyTorch: Using `nn` for Layer Abstractions
The `nn` package in PyTorch provides higher-level abstractions for neural networks, allowing users to define models using pre-built layers, loss functions, and optimizers.

#### Example: Two-Layer Network using `nn`
```python
import torch

# Input, output dimensions
N, D_in, H, D_out = 64, 1000, 100, 10

# Define the model using nn.Sequential
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)

# Loss function
loss_fn = torch.nn.MSELoss(reduction='sum')

learning_rate = 1e-4
for t in range(500):
    y_pred = model(x)
    loss = loss_fn(y_pred, y)
    print(t, loss.item())

    # Zero the gradients
    model.zero_grad()
    
    # Backward pass
    loss.backward()

    # Update weights
    with torch.no_grad():
        for param in model.parameters():
            param.data -= learning_rate * param.grad
```

### PyTorch: Optimizers with `optim`
Instead of manually updating weights, PyTorch's `optim` package provides abstractions for common optimization algorithms like SGD, Adam, RMSProp, etc.

#### Example: Using Adam Optimizer
```python
import torch

# Define model and loss function
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)
loss_fn = torch.nn.MSELoss(reduction='sum')

# Define optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

for t in range(500):
    y_pred = model(x)
    loss = loss_fn(y_pred, y)
    print(t, loss.item())
    
    # Zero gradients, backward pass, and update weights
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
```

This demonstrates how PyTorch's flexibility in dynamic graphs and its ease of use with `nn` and `optim` packages allow for powerful neural network modeling and optimization.

In [1]:
# Code: two_layer_net_numpy.py
import numpy as np

# Define network parameters
N, D_in, H, D_out = 64, 1000, 100, 10  # N: batch size, D_in: input dim, H: hidden dim, D_out: output dim

# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

learning_rate = 1e-6
for t in range(500):
    # Forward pass
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)

    # Compute and print loss
    loss = np.square(y_pred - y).sum()
    print(t, loss)

    # Backward pass to compute gradients
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0
    grad_w1 = x.T.dot(grad_h)

    # Update weights
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2


0 37338025.02300994
1 36577601.12595252
2 36612852.627226725
3 31416391.37271483
4 21672867.698657844
5 12078647.382830806
6 6120143.2445837455
7 3227548.8860854744
8 1942409.0602344465
9 1340734.2748368161
10 1018983.8511665803
11 818501.7659252007
12 677312.4715301059
13 569988.9290304305
14 484695.2676211521
15 415416.4135925022
16 358200.31577261654
17 310443.2739356288
18 270330.1024791559
19 236408.14543402605
20 207499.9935934868
21 182734.8407344056
22 161416.89130377333
23 142983.1814504591
24 126993.11375855764
25 113076.81753893393
26 100918.8968384983
27 90266.72630817394
28 80906.02480269548
29 72658.3331714608
30 65371.61703061804
31 58895.99662926106
32 53150.023135610434
33 48046.80486279775
34 43496.858353863994
35 39431.66462748908
36 35794.92604247674
37 32533.62496061922
38 29604.40284832956
39 26970.884536435264
40 24597.360158995707
41 22457.63785342501
42 20525.32527871606
43 18777.59615791388
44 17194.76744142731
45 15759.454538449638
46 14457.819504134064
47 13

In [2]:
# Code: two_layer_net_pytorch.py
import torch

# Set device (CUDA for GPU if available)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Network parameters
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = torch.randn(N, D_in, device=device)
y = torch.randn(N, D_out, device=device)

# Initialize weights with requires_grad=True to enable autograd
w1 = torch.randn(D_in, H, device=device, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, requires_grad=True)

learning_rate = 1e-6

for t in range(500):
    # Forward pass
    h = x.mm(w1)
    h_relu = torch.relu(h)  # ReLU is now directly supported
    y_pred = h_relu.mm(w2)

    # Compute loss
    loss = torch.nn.functional.mse_loss(y_pred, y)  # Built-in MSE loss
    print(t, loss.item())

    # Backward pass (automatic with PyTorch autograd)
    loss.backward()

    # Update weights using torch.no_grad() to avoid tracking this in autograd
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad

        # Manually zero the gradients after each step
        w1.grad.zero_()
        w2.grad.zero_()


0 60853.15625
1 60632.0390625
2 60412.3125
3 60194.0
4 59977.07421875
5 59761.51171875
6 59547.32421875
7 59334.5
8 59123.02734375
9 58912.8828125
10 58704.0703125
11 58496.57421875
12 58290.38671875
13 58085.5
14 57881.9140625
15 57679.6015625
16 57478.5625
17 57278.7890625
18 57080.27734375
19 56883.0
20 56686.96484375
21 56492.1640625
22 56298.5703125
23 56106.19921875
24 55915.0390625
25 55725.0625
26 55536.27734375
27 55348.6640625
28 55162.21875
29 54976.9375
30 54792.8125
31 54609.8203125
32 54427.96875
33 54247.24609375
34 54067.63671875
35 53889.15234375
36 53711.76171875
37 53535.4765625
38 53360.26953125
39 53186.14453125
40 53013.09375
41 52841.11328125
42 52670.1875
43 52500.3125
44 52331.48046875
45 52163.69140625
46 51996.92578125
47 51831.18359375
48 51666.44921875
49 51502.73046875
50 51340.0078125
51 51178.2734375
52 51017.52734375
53 50857.765625
54 50698.97265625
55 50541.15234375
56 50384.28125
57 50228.36328125
58 50073.40234375
59 49919.37109375
60 49766.26953125

In [4]:
import torch

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs.
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Use the nn package to define our model and loss function.
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)
loss_fn = torch.nn.MSELoss(reduction='sum')

# Use the optim package to define an Optimizer that will update the weights
# of the model for us. Here we will use Adam optimizer.
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

# Training loop
for t in range(500):
    # Forward pass: compute predicted y by passing x to the model
    y_pred = model(x)

    # Compute and print loss
    loss = loss_fn(y_pred, y)
    print(t, loss.item())

    # Zero the gradients before running the backward pass
    optimizer.zero_grad()

    # Backward pass: compute gradient of the loss with respect to all learnable parameters
    loss.backward()

    # Update the parameters using the optimizer
    optimizer.step()


0 651.315185546875
1 635.20068359375
2 619.5158081054688
3 604.2950439453125
4 589.5348510742188
5 575.2011108398438
6 561.2222900390625
7 547.69189453125
8 534.544921875
9 521.8331909179688
10 509.4169616699219
11 497.3133239746094
12 485.4803771972656
13 473.89947509765625
14 462.5671691894531
15 451.5332336425781
16 440.76434326171875
17 430.3234558105469
18 420.21759033203125
19 410.3592529296875
20 400.739501953125
21 391.3398742675781
22 382.22027587890625
23 373.3459777832031
24 364.6662902832031
25 356.2093505859375
26 347.9697265625
27 339.9106140136719
28 332.00531005859375
29 324.2637939453125
30 316.7051086425781
31 309.3155517578125
32 302.0869445800781
33 295.03125
34 288.1441345214844
35 281.4339294433594
36 274.888916015625
37 268.470458984375
38 262.17535400390625
39 256.0076904296875
40 249.972900390625
41 244.0723419189453
42 238.29501342773438
43 232.60977172851562
44 227.01943969726562
45 221.54942321777344
46 216.1769256591797
47 210.9066925048828
48 205.761657714

In [5]:
# Code: two_layer_net_amp.py
import torch
from torch.cuda.amp import autocast, GradScaler

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

N, D_in, H, D_out = 64, 1000, 100, 10

x = torch.randn(N, D_in, device=device)
y = torch.randn(N, D_out, device=device)

w1 = torch.randn(D_in, H, device=device, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, requires_grad=True)

scaler = GradScaler()  # For automatic mixed precision
learning_rate = 1e-6

for t in range(500):
    # Forward pass under autocast for mixed precision
    with autocast():
        h = x.mm(w1)
        h_relu = torch.relu(h)
        y_pred = h_relu.mm(w2)
        loss = torch.nn.functional.mse_loss(y_pred, y)

    print(t, loss.item())

    # Backward pass with scaler
    scaler.scale(loss).backward()

    # Update weights using scaler and no_grad
    with torch.no_grad():
        scaler.step(optimizer)
        scaler.update()

        w1.grad.zero_()
        w2.grad.zero_()


0 50592.66796875
1 50592.66796875
2 50592.66796875
3 50592.66796875
4 50592.66796875
5 50592.66796875
6 50592.66796875
7 50592.66796875
8 50592.66796875
9 50592.66796875
10 50592.66796875
11 50592.66796875
12 50592.66796875
13 50592.66796875
14 50592.66796875
15 50592.66796875
16 50592.66796875
17 50592.66796875
18 50592.66796875
19 50592.66796875
20 50592.66796875
21 50592.66796875
22 50592.66796875
23 50592.66796875
24 50592.66796875
25 50592.66796875
26 50592.66796875
27 50592.66796875
28 50592.66796875
29 50592.66796875
30 50592.66796875
31 50592.66796875
32 50592.66796875
33 50592.66796875
34 50592.66796875
35 50592.66796875
36 50592.66796875
37 50592.66796875
38 50592.66796875
39 50592.66796875
40 50592.66796875
41 50592.66796875
42 50592.66796875
43 50592.66796875
44 50592.66796875
45 50592.66796875
46 50592.66796875
47 50592.66796875
48 50592.66796875
49 50592.66796875
50 50592.66796875
51 50592.66796875
52 50592.66796875
53 50592.66796875
54 50592.66796875
55 50592.66796875
56

In [6]:
import torch

class MyReLU(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x):
        # Cache the input tensor for the backward pass
        ctx.save_for_backward(x)
        return x.clamp(min=0)

    @staticmethod
    def backward(ctx, grad_output):
        # Retrieve the cached tensor from the context
        x, = ctx.saved_tensors
        grad_x = grad_output.clone()
        grad_x[x < 0] = 0
        return grad_x

# Define input, output dimensions, and random data
N, D_in, H, D_out = 64, 1000, 100, 10
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)
w1 = torch.randn(D_in, H, requires_grad=True)
w2 = torch.randn(H, D_out, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    # Use custom ReLU in forward pass
    y_pred = MyReLU.apply(x.mm(w1)).mm(w2)
    
    # Compute loss
    loss = (y_pred - y).pow(2).sum()
    print(t, loss.item())

    # Perform backward pass
    loss.backward()
    
    # Update weights
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad
        w1.grad.zero_()
        w2.grad.zero_()


0 28674472.0
1 22100178.0
2 20705210.0
3 21082048.0
4 21247376.0
5 19573704.0
6 15988871.0
7 11441961.0
8 7432249.0
9 4562696.5
10 2798877.5
11 1780412.875
12 1205988.625
13 873037.0
14 670717.375
15 539382.5625
16 448152.0625
17 380657.1875
18 328104.78125
19 285656.875
20 250466.515625
21 220776.46875
22 195410.921875
23 173550.125
24 154597.671875
25 138080.875
26 123620.78125
27 110937.0078125
28 99758.375
29 89883.25
30 81132.859375
31 73369.5078125
32 66451.15625
33 60271.5546875
34 54742.96875
35 49790.625
36 45342.55078125
37 41351.5703125
38 37757.26953125
39 34514.08203125
40 31582.283203125
41 28928.29296875
42 26522.947265625
43 24339.8515625
44 22356.3046875
45 20553.705078125
46 18913.765625
47 17418.302734375
48 16052.8671875
49 14808.05078125
50 13671.322265625
51 12629.201171875
52 11674.2802734375
53 10798.2861328125
54 9994.0341796875
55 9255.259765625
56 8575.712890625
57 7950.86669921875
58 7375.19091796875
59 6844.8642578125
60 6355.640625
61 5904.31982421875
62 5

In [8]:
import tensorflow as tf

# Input, output dimensions and placeholders
N, D_in, H, D_out = 64, 1000, 100, 10

# Input data and labels (using tf.random to simulate data)
x = tf.random.normal((N, D_in))
y = tf.random.normal((N, D_out))

# Define the model using the Sequential API
model = tf.keras.Sequential([
    tf.keras.layers.Dense(H, activation='relu'),
    tf.keras.layers.Dense(D_out)
])

# Define a loss function
loss_fn = tf.keras.losses.MeanSquaredError()

# Define an optimizer
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-4)

# Training loop
for t in range(500):
    with tf.GradientTape() as tape:
        # Forward pass
        y_pred = model(x)
        # Compute the loss
        loss = loss_fn(y, y_pred)
    
    # Compute gradients
    gradients = tape.gradient(loss, model.trainable_variables)
    # Update weights
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))
    
    if t % 10 == 0:
        print(f"Step {t}, Loss: {loss.numpy()}")


2024-09-16 17:05:37.027505: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-09-16 17:05:37.028271: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1960] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...


Step 0, Loss: 2.5016636848449707
Step 10, Loss: 1.5964473485946655
Step 20, Loss: 0.9760836958885193
Step 30, Loss: 0.5803492665290833
Step 40, Loss: 0.3403693437576294
Step 50, Loss: 0.19694332778453827
Step 60, Loss: 0.11284787952899933
Step 70, Loss: 0.06403757631778717
Step 80, Loss: 0.036198221147060394
Step 90, Loss: 0.02056443877518177
Step 100, Loss: 0.011755162850022316
Step 110, Loss: 0.006751114968210459
Step 120, Loss: 0.003943950869143009
Step 130, Loss: 0.002344132401049137
Step 140, Loss: 0.0014119803672656417
Step 150, Loss: 0.0008625935297459364
Step 160, Loss: 0.0005376944318413734
Step 170, Loss: 0.0003402612928766757
Step 180, Loss: 0.0002157592389266938
Step 190, Loss: 0.0001359271991532296
Step 200, Loss: 8.459204400423914e-05
Step 210, Loss: 5.1843453547917306e-05
Step 220, Loss: 3.131864286842756e-05
Step 230, Loss: 1.860739939729683e-05
Step 240, Loss: 1.085907297238009e-05
Step 250, Loss: 6.224881417438155e-06
Step 260, Loss: 3.503816515149083e-06
Step 270, Lo

In [11]:
import tensorflow as tf

# Define input, hidden, and output dimensions
N, D_in, H, D_out = 64, 1000, 100, 10

# Random input and target data
x = tf.random.normal((N, D_in))
y = tf.random.normal((N, D_out))

# Define a simple feedforward model using Keras Sequential API
model = tf.keras.Sequential([
    tf.keras.layers.Dense(H, activation='relu'),  # Hidden layer with ReLU activation
    tf.keras.layers.Dense(D_out)                  # Output layer
])

# Define loss function (mean squared error) and optimizer (Adam)
loss_fn = tf.keras.losses.MeanSquaredError()
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-4)

# Training loop
for t in range(500):
    with tf.GradientTape() as tape:
        # Forward pass: compute predicted y
        y_pred = model(x)
        # Compute the loss
        loss = loss_fn(y, y_pred)
    
    # Compute gradients of loss with respect to model variables
    gradients = tape.gradient(loss, model.trainable_variables)
    
    # Apply gradients to update the model's weights
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))
    
    if t % 10 == 0:
        print(f"Step {t}, Loss: {loss.numpy()}")


Step 0, Loss: 2.4008524417877197
Step 10, Loss: 1.5306938886642456
Step 20, Loss: 0.9340640306472778
Step 30, Loss: 0.5536028146743774
Step 40, Loss: 0.32213568687438965
Step 50, Loss: 0.18477827310562134
Step 60, Loss: 0.10479460656642914
Step 70, Loss: 0.059141602367162704
Step 80, Loss: 0.03317642584443092
Step 90, Loss: 0.01850474253296852
Step 100, Loss: 0.010273261927068233
Step 110, Loss: 0.005659168586134911
Step 120, Loss: 0.0030865618027746677
Step 130, Loss: 0.001668134587816894
Step 140, Loss: 0.0008960291161201894
Step 150, Loss: 0.00047824953799135983
Step 160, Loss: 0.00025412969989702106
Step 170, Loss: 0.00013539525389205664
Step 180, Loss: 7.483353692805395e-05
Step 190, Loss: 4.315172554925084e-05
Step 200, Loss: 2.574067548266612e-05
Step 210, Loss: 1.5804562281118706e-05
Step 220, Loss: 9.929213774739765e-06
Step 230, Loss: 6.341164407785982e-06
Step 240, Loss: 4.091427854291396e-06
Step 250, Loss: 2.6497045837459154e-06
Step 260, Loss: 1.7140461068265722e-06
Step 