From 0bd33d1f92147c93849d10a97c29e9b66ac64d48 Mon Sep 17 00:00:00 2001
From: bclarkson-code <57139598+bclarkson-code@users.noreply.github.com>
Date: Sat, 10 Feb 2024 12:47:01 +0000
Subject: [PATCH 1/3] Update README.md
---
README.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/README.md b/README.md
index b5aa76a..cbdef6a 100644
--- a/README.md
+++ b/README.md
@@ -22,7 +22,7 @@ Here are some things you can do with Tricycle:
Here are some things you can't do with Tricycle (yet):
- Do anything at the speed of pytorch
-- Use any built in layers, optimisers, regularisation techniques etc
+- Perform more advanced operations like Attention
- Use a GPU
If you want to do these things, you should check out [pytorch](https://pytorch.org/)
From 1d808bd3036b9bd31b34e8c4a7888273ca2ec21d Mon Sep 17 00:00:00 2001
From: bclarkson-code <57139598+bclarkson-code@users.noreply.github.com>
Date: Sat, 10 Feb 2024 13:00:31 +0000
Subject: [PATCH 2/3] Update README.md
---
README.md | 2 ++
1 file changed, 2 insertions(+)
diff --git a/README.md b/README.md
index cbdef6a..9ec0dad 100644
--- a/README.md
+++ b/README.md
@@ -40,3 +40,5 @@ Tricycle is tested using [pytest](https://docs.pytest.org/en/latest/)
poetry run pytest
```
+## Contact
+To get in touch, you can send an email to: [bclarkson-code@proton.me](mailto:bclarkson-code@proton.me)
From fc82237e3a5fd488aa3853be889f67d47c159209 Mon Sep 17 00:00:00 2001
From: bclarkson-code <57139598+bclarkson-code@users.noreply.github.com>
Date: Sat, 10 Feb 2024 14:29:48 +0000
Subject: [PATCH 3/3] Blog post 1 (#10)
* Started workm on blog post
* Finished first 2 chapters
* Added more to blog
* Completed graphical derivatives
* Finished first draft of content
* Updated blog post
* Added nicer pictures
* Updated images in explanation
* Updted post
* Finished all but the intro
* Added tech tree
* Added intorduction
---
blog_post_1.ipynb | 1597 ++++++++++++++++++++++++++++
images/square_error.png | Bin 0 -> 122012 bytes
images/tech_tree_post_1.png | Bin 0 -> 339174 bytes
images/variable_and_box.png | Bin 0 -> 38069 bytes
images/y_eq_mx_plus_p_deriv.png | Bin 0 -> 154834 bytes
images/y_eq_mx_plus_p_labelled.png | Bin 0 -> 53802 bytes
images/z_eq_mx.png | Bin 0 -> 57343 bytes
images/z_eq_xx_plus_xc.png | Bin 0 -> 85621 bytes
8 files changed, 1597 insertions(+)
create mode 100644 blog_post_1.ipynb
create mode 100644 images/square_error.png
create mode 100644 images/tech_tree_post_1.png
create mode 100644 images/variable_and_box.png
create mode 100644 images/y_eq_mx_plus_p_deriv.png
create mode 100644 images/y_eq_mx_plus_p_labelled.png
create mode 100644 images/z_eq_mx.png
create mode 100644 images/z_eq_xx_plus_xc.png
diff --git a/blog_post_1.ipynb b/blog_post_1.ipynb
new file mode 100644
index 0000000..20f5049
--- /dev/null
+++ b/blog_post_1.ipynb
@@ -0,0 +1,1597 @@
+{
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "id": "b9938fb2-6080-4c89-bcac-bf815d78f1bc",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from typing import Any, Optional\n",
+ "\n",
+ "import networkx as nx"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f0009d17-27f8-48f8-a099-fbec7c5157e5",
+ "metadata": {},
+ "source": [
+ "# Building an LLM from scratch\n",
+ "## Automatic Differentiation"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "50d2a732-5cc2-42c8-9426-c0bf135e35c4",
+ "metadata": {},
+ "source": [
+ "Approach 1: I have some holes that need filling\n",
+ "When trying to follow the latest research, I sometimes get the impression that I'm missing something, knowing the name of a technique and vaguely how it works but not *actually* understanding it (Feynmann, 1995). It feels like there is an insurmountable pile of research, with each development depending on all of the previous developments. Of course, it doesn't help that information is shared through a combination of dense scientific papers and deluge of posts on Twitter/X which makes finding the signal amongst the noise very challenging.\n",
+ "\n",
+ "When I've been confused about something technical in the past, one of the things that has helped me is getting my hands dirty and doing it myself. With maths at university, this was trying to teach my classmates. Here, I can't think of a better method that building a modern language model myself.\n",
+ "\n",
+ "Nowadays, it is \"relatively\" simple to copy-paste some code from huggingface, feed in some data and get a language model training. The whole process can be done in less than an hour and is a fantastic example of giving the public access to the latest research. Unfortunately, this ease of use is only possible by hiding a lot of important details which leaves people (me) with the feeling that there are gaps in their knowledge.\n",
+ "\n",
+ "Approach 2: Deep learning is done top down. I want to do bottom up\n",
+ "One of the best things about the AI community is how open and accessible (with some notable exceptions, I'm looking at you \"Open\"AI) modern developments are. With a bit of python knowledge, you can go to huggingface, copy-paste some code, feed in some data and start training a state of the art language model. The whole process can be done in less than an hour and is a fantastic example of giving the public access to the latest research. Unfortunately, this ease of use is only possible by hiding a lot of important details which leaves people (me) with the feeling that there might be gaps in their knowledge.\n",
+ "\n",
+ "I like to imagine deep learning research as a tree, with each new technique sprouting from the branches of earlier developments. To understand something new, you can read through the paper and every time you come across something you don't understand, you can follow the tree back a few branches to get the necessary background and then continue with the paper.\n",
+ "\n",
+ "I think that a lot of people who get into the field learn the basic theory (e.g gradient descent) and then jump to the cutting edge, filling in any gaps by following the branches backwards as they go. This is a great way to do useful things as fast as possible but also requires a lot of work to fill in the gaps every time new research comes out. Given the pace of development, it is easy to feel overwhelmed with everything that is going on.\n",
+ "\n",
+ "To fix this issue, I want to start from the bottom of the tree and work up, building a modern language model with all the bells and whistles completely from scratch. Borrowing (shamelessly stealing) from computer games, I've built a tech tree of everything that I think I'll need to implement to get a fully functional language model. If you think anything is missing, please let me [know](mailto:bclarkson-code@proton.me): \n",
+ "\n",
+ "
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "\n",
+ "Before we can move onto building modern features like Rotary Positional Encodings, we first need to figure out how to differentiate with a computer. The backpropagation algorithm that underpins the entire field of Deep Learning requires the ability to differentiate the outputs of neural networks with respect to (wrt) their inputs. In this post, we'll go from nothing to a (admittedly very limited) automatic differentiation library that can differentiate arbitrary functions of scalar values.\n",
+ "\n",
+ "This one algorithm will form the core of out deep learning library that, eventually, will include everything we need to train a language model. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "477924ef-6dfc-43c2-881a-2f69167e1920",
+ "metadata": {},
+ "source": [
+ "## Creating a tensor \n",
+ "We can't do any differentiation if we don't have any numbers to differentiate. We'll want to add some extra functionality that is in standard `float` types so we'll need to create out own. Lets call it a `Tensor`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "id": "fdfa0e00-4a53-4e52-9d02-99f93e9e303e",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "Tensor(5)"
+ ]
+ },
+ "execution_count": 2,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "class Tensor:\n",
+ " \"\"\"\n",
+ " Just a number (for now)\n",
+ " \"\"\"\n",
+ "\n",
+ " value: float\n",
+ "\n",
+ " def __init__(self, value: float):\n",
+ " self.value = value\n",
+ "\n",
+ " def __repr__(self) -> str:\n",
+ " \"\"\"\n",
+ " Create a printable string representation of this\n",
+ " object\n",
+ "\n",
+ " This function gets called when you pass a Tensor to print\n",
+ "\n",
+ " Without this function:\n",
+ " >>> print(Tensor(5))\n",
+ " <__main__.Tensor at 0x104fd1950>\n",
+ "\n",
+ " With this function:\n",
+ " >>> print(Tensor(5))\n",
+ " Tensor(5)\n",
+ " \"\"\"\n",
+ " return f\"Tensor({self.value})\"\n",
+ "\n",
+ "\n",
+ "# try it out\n",
+ "Tensor(5)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7aa4ebb0-e1a0-4e1f-a5ed-5fa0b0ce62ba",
+ "metadata": {},
+ "source": [
+ "Next we'll need some simple operations we want to perform: addition, subtraction and multiplication."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "id": "c01dc061-c61f-4b3d-b70a-06129efd9fae",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def _add(a: Tensor, b: Tensor):\n",
+ " \"\"\"\n",
+ " Add two tensors\n",
+ " \"\"\"\n",
+ " return Tensor(a.value + b.value)\n",
+ "\n",
+ "\n",
+ "def _sub(a: Tensor, b: Tensor):\n",
+ " \"\"\"\n",
+ " Subtract tensor b from tensor a\n",
+ " \"\"\"\n",
+ " return Tensor(a.value - b.value)\n",
+ "\n",
+ "\n",
+ "def _mul(a: Tensor, b: Tensor):\n",
+ " \"\"\"\n",
+ " Multiply two tensors\n",
+ " \"\"\"\n",
+ " return Tensor(a.value * b.value)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fade0f0f-988d-4f4f-a495-8f414f2381d9",
+ "metadata": {},
+ "source": [
+ "We can use use our operations as follows:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "id": "4007641b-ec4a-4cdb-9160-427bf0ec2b6f",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "✅ - Want: 7, Got: 7\n",
+ "✅ - Want: -1, Got: -1\n",
+ "✅ - Want: 12, Got: 12\n"
+ ]
+ }
+ ],
+ "source": [
+ "def test(got: Any, want: Any):\n",
+ " \"\"\"\n",
+ " Check that two objects are equal to each other\n",
+ " \"\"\"\n",
+ " indicator = \"✅\" if want == got else \"❌\"\n",
+ " print(f\"{indicator} - Want: {want}, Got: {got}\")\n",
+ "\n",
+ "\n",
+ "a = Tensor(3)\n",
+ "b = Tensor(4)\n",
+ "\n",
+ "\n",
+ "test(_add(a, b).value, 7)\n",
+ "test(_sub(a, b).value, -1)\n",
+ "test(_mul(a, b).value, 12)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "605f5529-fc6f-43a2-b299-eb25cbc0523f",
+ "metadata": {},
+ "source": [
+ "## Scalar derivatives\n",
+ "Diving straight into differentiating matrices sounds too hard so lets start with something simpler: differentiating scalars. The simplest scalar derivative I can think of is the derivative of a tensor with respect to (wrt) itself:\n",
+ "$$\\frac{dx}{dx} = 1$$"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "db267512-5932-411f-8204-b87d688c79d1",
+ "metadata": {},
+ "source": [
+ "A more interesting case is the derivative of two tensors added together (note we are using partial derivatives because our function has multiple inputs):\n",
+ "$$f(x, y) = x + y$$\n",
+ "$$\\frac{\\partial f}{\\partial x} = 1$$\n",
+ "$$\\frac{\\partial f}{\\partial y} = 1$$\n",
+ "\n",
+ "We can do a similar thing for multiplication and subtraction\n",
+ "\n",
+ "|$f(x, y)$|$\\frac{\\partial f}{\\partial x}$|$\\frac{\\partial f}{\\partial y}$|\n",
+ "|-|-|-|\n",
+ "|$x + y$|$1$|$1$|\n",
+ "|$x - y$|$1$|$-1$|\n",
+ "|$x \\times y$|$y$|$x$|"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b3ed0850-456b-4127-b557-d7cf39524e47",
+ "metadata": {},
+ "source": [
+ "Now that we've worked out these derivatives mathematically, the next step is to convert them into code. In the table above, when we make a tensor by combining two tensors with an operation, the derivative only ever depends on the inputs and the operation. There is no \"hidden state\".\n",
+ "\n",
+ "This means that the only information we need to store is the inputs to an operation and a function to calculate the derivative wrt each inputs. With this, we should be able to differentiate any binary function wrt its inputs. A good place to store this information is in the tensor that is produced by the operation.\n",
+ "\n",
+ "We'll add some new attributes to our `Tensor`: `args` `local_derivatives`. If the tensor is the output of an operation, then `args` will store the arguments to the operation and `local_derivatives` will store the derivatives wrt each input. We're calling it `local_derivatives` to avoid confusion when we start nesting functions.\n",
+ "\n",
+ "Once we've calculated the derivative (from our `args` and `local_derivatives`) we'll need to store it. It turns out that the neatest place to put this is in tensor that the output is being differentiated wrt. We'll call this `derivative`"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "id": "36f5546b-33a1-4808-be0d-77a898938066",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "class Tensor:\n",
+ " \"\"\"\n",
+ " A number that can be differentiated\n",
+ " \"\"\"\n",
+ "\n",
+ " # If the tensor was made by an operation, the operation arguments\n",
+ " # are stored in args\n",
+ " args: tuple[\"Tensor\"] = ()\n",
+ " # If the tensor was made by an operation, the derivatives wrt\n",
+ " # operation inputs are stored in derivatives\n",
+ " local_derivatives: tuple[\"Tensor\"] = ()\n",
+ " # The derivative we have calculated\n",
+ " derivative: Optional[\"Tensor\"] = None\n",
+ "\n",
+ " def __init__(self, value: float):\n",
+ " self.value = value\n",
+ "\n",
+ " def __repr__(self) -> str:\n",
+ " \"\"\"\n",
+ " Create a printable string representation of this\n",
+ " object\n",
+ "\n",
+ " This function gets called when you pass a Tensor to print\n",
+ "\n",
+ " Without this function:\n",
+ " >>> print(Tensor(5))\n",
+ " <__main__.Tensor at 0x104fd1950>\n",
+ "\n",
+ " With this function:\n",
+ " >>> print(Tensor(5))\n",
+ " Tensor(5)\n",
+ " \"\"\"\n",
+ " return f\"Tensor({self.value})\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "befb4571-7c6f-4e07-8731-d511dff0667e",
+ "metadata": {},
+ "source": [
+ "For example, if we have "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "id": "ede7d7e3-b1c1-4f41-9657-316cbe0eee51",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "a = Tensor(3)\n",
+ "b = Tensor(4)\n",
+ "\n",
+ "output = _mul(a, b)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1df49bf3-7c4f-414b-8582-d65c68ca6563",
+ "metadata": {},
+ "source": [
+ "Then `output.args` and `output.local_derivatives` should be equal to should be set to:\n",
+ "\n",
+ "```python\n",
+ "output.args == (Tensor(3), Tensor(4))\n",
+ "output.derivatives == (\n",
+ " b, # derivative of output wrt a is b\n",
+ " a, # derivative of output wrt b is a\n",
+ ")\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fd49ca9b-e27e-4400-9542-fa1912fd63fa",
+ "metadata": {},
+ "source": [
+ "Once we have actually computed the derivatives, then the derivative of `output` wrt `a` will be stored in `a.derivative` and should be equal to `b` (which is 4 in this case). \n",
+ "\n",
+ "We know that we've done everything right once these tests pass:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "id": "8d86f1c3-9a65-4468-9089-20a6bf0691c7",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "❌ - Want: (Tensor(3), Tensor(4)), Got: ()\n",
+ "❌ - Want: (Tensor(4), Tensor(3)), Got: ()\n",
+ "❌ - Want: Tensor(4), Got: None\n",
+ "❌ - Want: Tensor(3), Got: None\n"
+ ]
+ }
+ ],
+ "source": [
+ "a = Tensor(3)\n",
+ "b = Tensor(4)\n",
+ "\n",
+ "output = _mul(a, b)\n",
+ "\n",
+ "# TODO: differentiate here\n",
+ "\n",
+ "test(got=output.args, want=(a, b))\n",
+ "test(got=output.local_derivatives, want=(b, a))\n",
+ "test(a.derivative, b)\n",
+ "test(b.derivative, a)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2ed01e92-17b5-4aba-8d32-0be2983466e1",
+ "metadata": {},
+ "source": [
+ "First, lets add a function to our `Tensor` that will actually calculate the derivatives for each of the function arguments. Pytorch calls this function `backward` so we'll do the same."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "id": "3f0c4e1e-66e2-43e1-ae62-f029347369c8",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "class Tensor:\n",
+ " \"\"\"\n",
+ " A number that can be differentiated\n",
+ " \"\"\"\n",
+ "\n",
+ " # If the tensor was made by an operation, the operation arguments\n",
+ " # are stored in args\n",
+ " args: tuple[\"Tensor\"] = ()\n",
+ " # If the tensor was made by an operation, the derivatives wrt\n",
+ " # operation inputs are stored in\n",
+ " local_derivatives: tuple[\"Tensor\"] = ()\n",
+ " # The derivative we have calculated\n",
+ " derivative: Optional[\"Tensor\"] = None\n",
+ "\n",
+ " def __init__(self, value: float):\n",
+ " self.value = value\n",
+ "\n",
+ " def backward(self):\n",
+ " if self.args is None or self.local_derivatives is None:\n",
+ " raise ValueError(\n",
+ " \"Cannot differentiate a Tensor that is not a function of other Tensors\"\n",
+ " )\n",
+ "\n",
+ " for arg, derivative in zip(self.args, self.local_derivatives):\n",
+ " arg.derivative = derivative\n",
+ "\n",
+ " def __repr__(self) -> str:\n",
+ " \"\"\"\n",
+ " Create a printable string representation of this\n",
+ " object\n",
+ "\n",
+ " This function gets called when you pass a Tensor to print\n",
+ "\n",
+ " Without this function:\n",
+ " >>> print(Tensor(5))\n",
+ " <__main__.Tensor at 0x104fd1950>\n",
+ "\n",
+ " With this function:\n",
+ " >>> print(Tensor(5))\n",
+ " Tensor(5)\n",
+ " \"\"\"\n",
+ " return f\"Tensor({self.value})\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "92d07a1a-eae0-40b4-84d6-4e57b00f16f1",
+ "metadata": {},
+ "source": [
+ "This only works if we also store the arguments and derivatives in the output tensors of operations"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "id": "99f6cd69-55a0-485d-a169-bf7f32e16e21",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def _add(a: Tensor, b: Tensor):\n",
+ " \"\"\"\n",
+ " Add two tensors\n",
+ " \"\"\"\n",
+ " result = Tensor(a.value + b.value)\n",
+ " result.local_derivatives = (Tensor(1), Tensor(1))\n",
+ " result.args = (a, b)\n",
+ " return result\n",
+ "\n",
+ "\n",
+ "def _sub(a: Tensor, b: Tensor):\n",
+ " \"\"\"\n",
+ " Subtract tensor b from a\n",
+ " \"\"\"\n",
+ " result = Tensor(a.value - b.value)\n",
+ " result.local_derivatives = (Tensor(1), Tensor(-1))\n",
+ " result.args = (a, b)\n",
+ " return result\n",
+ "\n",
+ "\n",
+ "def _mul(a: Tensor, b: Tensor):\n",
+ " \"\"\"\n",
+ " Multiply two tensors\n",
+ " \"\"\"\n",
+ " result = Tensor(a.value * b.value)\n",
+ " result.local_derivatives = (b, a)\n",
+ " result.args = (a, b)\n",
+ " return result"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "bc72894d-3d9a-4dfa-aca3-b5b97e57c729",
+ "metadata": {},
+ "source": [
+ "Lets re-run our tests and see if it works"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "id": "77657e72-49ea-49a7-a8df-7edc6ad0278a",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "✅ - Want: (Tensor(3), Tensor(4)), Got: (Tensor(3), Tensor(4))\n",
+ "✅ - Want: (Tensor(4), Tensor(3)), Got: (Tensor(4), Tensor(3))\n",
+ "✅ - Want: Tensor(4), Got: Tensor(4)\n",
+ "✅ - Want: Tensor(3), Got: Tensor(3)\n"
+ ]
+ }
+ ],
+ "source": [
+ "a = Tensor(3)\n",
+ "b = Tensor(4)\n",
+ "\n",
+ "output = _mul(a, b)\n",
+ "\n",
+ "output.backward()\n",
+ "\n",
+ "test(got=output.args, want=(a, b))\n",
+ "test(got=output.local_derivatives, want=(b, a))\n",
+ "test(a.derivative, b)\n",
+ "test(b.derivative, a)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fddf7959-2f69-497b-9a6c-0c31cb375e5c",
+ "metadata": {},
+ "source": [
+ "So far so good, lets try nesting operations."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "id": "0ea3288f-24db-4fa8-8a11-d9ea6dec2ffa",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "❌ - Want: Tensor(3), Got: None\n"
+ ]
+ }
+ ],
+ "source": [
+ "a = Tensor(3)\n",
+ "b = Tensor(4)\n",
+ "\n",
+ "output_1 = _mul(a, b)\n",
+ "# z = a + (a * b)\n",
+ "output_2 = _add(a, output_1)\n",
+ "\n",
+ "output_2.backward()\n",
+ "\n",
+ "# should get\n",
+ "# dz/db = 0 + a = a\n",
+ "test(b.derivative, a)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c09a6438-a069-4616-99b4-e58b7b5ac183",
+ "metadata": {},
+ "source": [
+ "Something has gone wrong. \n",
+ "\n",
+ "We should have got `a` as the derivative for `b` but we got `0` instead. Looking through the `.backward()` function, the issue is pretty clear:\n",
+ "we haven't thought about nested functions. To get this example working, we'll need to figure out how to calculate derivatives through multiple functions instead of just one."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "04b75711-af3e-446e-b2be-05fd55c00927",
+ "metadata": {},
+ "source": [
+ "## Chaining Functions Together"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7377a549-58dd-4779-847e-461ff1300e90",
+ "metadata": {},
+ "source": [
+ "To calculate derivates of nested functions, we can use a rule from calculus: The Chain Rule.\n",
+ "\n",
+ "For a variable $z$ generated by nested functions $f$ and $g$ such that\n",
+ "$$z = f(g(x))$$\n",
+ "\n",
+ "Then the derivative of $z$ wrt $x$ is:\n",
+ "$$\\frac{\\partial z}{\\partial x} = \\frac{\\partial f(u)}{\\partial u} \\frac{\\partial g(x)}{\\partial x}$$\n",
+ "\n",
+ "Here, $u$ is a dummy variable. $\\frac{\\partial f(u)}{\\partial u}$ means the derivative of $f$ wrt its input.\n",
+ "\n",
+ "For example, if \n",
+ "\n",
+ "$$f(x) = g(x)^2$$\n",
+ "Then we can define $u=g(x)$ and rewrite $f$ in terms of u \n",
+ "$$f(u) = u^2 \\implies \\frac{\\partial f(u)}{\\partial u} = 2u = 2 g(x)$$\n",
+ "\n",
+ "### Multiple Variables\n",
+ "The chain rule works as you might expect for functions of multiple variables. When differentiating wrt a variable, we can treat the other variables as constant and differentiate as normal\n",
+ "$$z = f(g(x), h(y))$$\n",
+ "\n",
+ "$$\\frac{\\partial z}{\\partial x} = \\frac{\\partial f(u)}{\\partial u} \\frac{\\partial g(x)}{\\partial x}$$\n",
+ "$$\\frac{\\partial z}{\\partial y} = \\frac{\\partial f(u)}{\\partial u} \\frac{\\partial h(y)}{\\partial y}$$\n",
+ "\n",
+ "If we have different functions that take the same input, we differentiate each of them individually and then add them together\n",
+ "\n",
+ "$$z = f(g(x), h(x))$$\n",
+ "\n",
+ "We get\n",
+ "$$\\frac{\\partial z}{\\partial x} = \\frac{\\partial f(u)}{\\partial u}\\frac{\\partial g(x)}{\\partial x} + \\frac{\\partial f(u)}{\\partial u}\\frac{\\partial h(x)}{\\partial x}$$\n",
+ "\n",
+ "### More than 2 functions\n",
+ "If we chain 3 functions together, we still just multiply the derivatives for each function together:\n",
+ "\n",
+ "$$\\frac{\\partial z}{\\partial x} = \\frac{\\partial f(u)}{\\partial u} \\frac{\\partial g(x)}{\\partial x} = \\frac{\\partial f(u)}{\\partial u} \\frac{\\partial g(u)}{\\partial u}\\frac{\\partial h(x)}{\\partial x}$$\n",
+ "\n",
+ "And this generalises to any amount of nesting\n",
+ "\n",
+ "$$z = f_1(f_2(....f_{n-1}(f_n(x))...)) \\implies \\frac{\\partial z}{\\partial x} = \\frac{\\partial f_1(u)}{\\partial u}\\frac{\\partial f_2(u)}{\\partial u}...\\frac{\\partial f_{n-1}(u)}{\\partial u}\\frac{\\partial f_{n}(x)}{\\partial x}$$"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "id": "65f9b9d4-0d9a-47bb-94d3-64c2dfe648eb",
+ "metadata": {},
+ "source": [
+ "### A picture is worth a thousand equations\n",
+ "As you probably noticed, the maths is starting to get quite dense. When we start working with neural networks, we can easily get 100s or 1000s of functions deep so to get a handle on things, we'll need a different strategy. Helpfully, there is one: turning it into a graph.\n",
+ "\n",
+ "We can start with some rules:\n",
+ "\n",
+ "> Variables are represented with circles and operations are represented with boxes\n",
+ "\n",
+ "
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "\n",
+ "\n",
+ "> Inputs to an operation are represented with arrows that point to the operation box. Outputs point away.\n",
+ "\n",
+ "For example, here is the diagram for $z = mx$\n",
+ "
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "\n",
+ "And thats it! All of the equations we'll be working with can be represented graphically using these simple rules. To try it out, let's draw the diagram for a more complex formula:\n",
+ "\n",
+ "
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "\n",
+ "This is an example of a structure called a graph (also called a network). A lot of problem in computer science get much easier if you can represent them with a graph and this is no exception."
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "id": "03129971-5cdb-454e-829b-a3d5e017207b",
+ "metadata": {},
+ "source": [
+ "The real power of these diagrams is that they can also help us with our derivatives. Take \n",
+ "$$y = mx + p = \\texttt{add}(p, \\texttt{mul}(m ,x)).$$\n",
+ "\n",
+ "From before, we can find its derivatives by differentiating each operation wrt its inputs and multiplying the results together. In this case, we get:\n",
+ "$$\\frac{\\partial y}{\\partial p} = \\frac{\\partial \\texttt{add}(u_1, u_2)}{\\partial u_1} = 1$$\n",
+ "$$\\frac{\\partial y}{\\partial m} = \\frac{\\partial \\texttt{add}(u_1, u_2)}{\\partial u_2}\\frac{\\partial \\texttt{mul}(u_1, u_2)}{\\partial u_2} = 1 \\times x = x$$\n",
+ "$$\\frac{\\partial y}{\\partial x} = \\frac{\\partial \\texttt{add}(u_1, u_2)}{\\partial u_2}\\frac{\\partial \\texttt{mul}(u_1, u_2)}{\\partial u_1} = 1 \\times m = m$$\n",
+ "\n",
+ "We can also graph it like this:\n",
+ "\n",
+ "
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "\n",
+ "If you imagine walking from $y$ to each of the inputs, you might notice a similarity between the edges you pass through and the equations above. If you walk from $y$ to $x$, you'll pass through `a->c->d`. Similarly, if you walk from $y$ to $m$, you'll pass through `a->d->e`. Notice that both paths go through `c`, the edge coming out of `add` that corresponds to the input $u_2$. Also, both equations include the term $\\frac{\\partial \\texttt{add}(u_1, u_2)}{\\partial u_2}$. \n",
+ "\n",
+ "If I rename the edges as follows:\n",
+ "\n",
+ "
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "\n",
+ "We can see that going from $y$ to $x$, we pass through $1$, $\\frac{\\partial \\texttt{add}(u_1, u_2)}{\\partial u_2}$ and $\\frac{\\partial \\texttt{mul}(u_1, u_2)}{\\partial u_1}$. If we multiply these together, we get exactly $\\frac{\\partial \\texttt{add}(u_1, u_2)}{\\partial u_2}\\frac{\\partial \\texttt{mul}(u_1, u_2)}{\\partial u_1} = \\frac{\\partial y}{\\partial x}$!\n",
+ "\n",
+ "It turns out that this rule works in general:\n",
+ "\n",
+ "> If we have some operation $\\texttt{op}(u_1, u_2, ..., u_n)$, we should label the edge corresponding to input $u_i$ with $\\frac{\\partial \\texttt{op}(u_1, u_2, ..., u_n)}{\\partial u_i}$\n",
+ "\n",
+ "Then, if we want to find the derivative of the output node wrt any of the inputs,\n",
+ "\n",
+ "> The derivative of an output variable wrt one of the input variables can be found by traversing the graph from the output to the input and multiplying together the derivatives for every edge on the path\n",
+ "\n",
+ "To cover every edge case, there are some extra details\n",
+ "\n",
+ "> If a graph contains multiple paths from the output to an input, then the derivative is the sum of the products for each path\n",
+ "\n",
+ "This comes from the case we saw earlier where when we have different functions that have the same input we have to add their derivative chains together.\n",
+ "\n",
+ "> If an edge is not the input to any function, its derivative is 1\n",
+ "\n",
+ "This covers the edge that leads from the final operation to the output. You can think of the edge having the derivative $\\frac{\\partial y}{\\partial y}=1$\n",
+ "\n",
+ "And thats it! Lets try it out with $z = (x + c)x$:\n",
+ "\n",
+ "
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "\n",
+ "Here, instead of writing the formulae for each derivative, I have gone ahead and calculated their actual values. Instead of just figuring out the formulae for a derivative, we want to calculate its value when we plug in our input parameters. \n",
+ "\n",
+ "All that remains is to multiply the local derivatives together along each path. We'll call the product of derivatives along a single path a chain (after the chain rule)\n",
+ "\n",
+ "We can get from $z$ to $x$ via the green path and the red path. Following these paths, we get:\n",
+ "$$\\text{red path} = 1 \\times (x + c) = x + c$$\n",
+ "Along the green path we get:\n",
+ "$$\\text{green path} = 1 \\times x \\times 1 = x$$\n",
+ "\n",
+ "Adding these together, we get $(x+c) + x = 2x + c$\n",
+ "\n",
+ "If we work out the derivative algebraically:\n",
+ "\n",
+ "$$\\frac{\\partial z}{\\partial x} = \\frac{\\partial}{\\partial x}((x+c)x) = \\frac{\\partial}{\\partial x}(x^2 + cx) = \\frac{\\partial x^2}{\\partial x} + c\\frac{\\partial x}{\\partial x} = 2x + c$$\n",
+ "\n",
+ "We can see that it seems to work! (Calculating $\\frac{\\partial z}{\\partial c}$ is left as an exercise for the reader) \n",
+ "\n",
+ "To summarise, we have invented the following algorithm for calculating of a variable wrt its inputs:\n",
+ "\n",
+ "1. Turn the equation into a graph\n",
+ "2. Label each edge with the appropriate derivative\n",
+ "3. Find every path from the output to the input variable you care about\n",
+ "4. Follow each path and multiply the derivatives you pass through\n",
+ "5. Add together the results for each path\n",
+ "\n",
+ "We have an algorithm in pictures and words, lets turn it into code."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "47989351-92e5-491c-a58a-b109c1d84f92",
+ "metadata": {},
+ "source": [
+ "### The Algorithm™\n",
+ "\n",
+ "Suprisingly, we have actually already converted our functions into graphs. If you recall, when we generate a tensor from an operation, we record the inputs to the operation in the output tensor (in `.args`). We also stored the functions to calculate derivatives for each of the inputs in `.local_derivatives` which means that we know both the destination and derivative for every edge that points to a given node. This means that we've already completed steps 1 and 2.\n",
+ "\n",
+ "The next challenge is to find all paths from the tensor we want to differentiate to the input tensors that created it. Because none of our operations are self referential (outputs are never fed back in as inputs), and all of our edges have a direction, our graph of operations is a directed acyclic graph or DAG. The property of the graph having no cycles means that we can find all paths to every parameter pretty easily with a Breadth First Search (or Depth First Search but BFS makes some optimisations easier as we'll see in part 2).\n",
+ "\n",
+ "To try it out, lets recreate that giant graph we made earlier. We can do this by first calculating $L$ from the inputs"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "id": "05bfab9c-fea6-40cb-bcd2-0c60a7f1950c",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "y = Tensor(1)\n",
+ "m = Tensor(2)\n",
+ "x = Tensor(3)\n",
+ "c = Tensor(4)\n",
+ "\n",
+ "# L = (y - (mx + c))^2\n",
+ "left = _sub(y, _add(_mul(m, x), c))\n",
+ "right = _sub(y, _add(_mul(m, x), c))\n",
+ "\n",
+ "L = _mul(left, right)\n",
+ "\n",
+ "# Attaching names to tensors will make our\n",
+ "# diagram look nicer\n",
+ "y.name = \"y\"\n",
+ "m.name = \"m\"\n",
+ "x.name = \"x\"\n",
+ "c.name = \"c\"\n",
+ "L.name = \"L\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "795a7000-73ef-45dc-b2e0-b76da641ae4d",
+ "metadata": {},
+ "source": [
+ "And then using Breadth First Search to do 3 things:\n",
+ " - Find all nodes\n",
+ " - Find all edges\n",
+ " - Find all paths from $L$ to our parameters\n",
+ "\n",
+ "We haven't implemented a simple way to check whether two tensors are identical so we'll need compare hashes."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "id": "6e20bfa5-4ebe-4ff3-a98c-94f5e099c589",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "edges = []\n",
+ "\n",
+ "stack = [(L, [L])]\n",
+ "\n",
+ "nodes = []\n",
+ "edges = []\n",
+ "while stack:\n",
+ " node, current_path = stack.pop()\n",
+ " # Record nodes we haven't seen before\n",
+ " if hash(node) not in [hash(n) for n in nodes]:\n",
+ " nodes.append(node)\n",
+ "\n",
+ " # If we have reached a parameter (it has no arguments\n",
+ " # because it wasn't created by an operation) then\n",
+ " # record the path taken to get here\n",
+ " if not node.args:\n",
+ " if not hasattr(node, \"paths\"):\n",
+ " node.paths = []\n",
+ " node.paths.append(current_path)\n",
+ " continue\n",
+ "\n",
+ " for arg in node.args:\n",
+ " stack.append((arg, current_path + [arg]))\n",
+ " # Record every new edge\n",
+ " edges.append((hash(node), hash(arg)))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "13c5a96e-9485-4fa7-b19a-54f4bf76f4c3",
+ "metadata": {},
+ "source": [
+ "Now we've got all of the edges and nodes, we have complete knowledge of our computational graph. Lets use networkx to plot it"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "id": "5f73c63a-2e9f-425f-b4a5-52604bcbb3e5",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "image/png": "",
+ "text/plain": [
+ "