# Introduction to Reinforcement Learning 
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/eleni-vasilaki/rl-notes/raw/main/notebooks/01_introduction.ipynb)

## What Reinforcement Learning Tells Me About Happiness

I'd go so far as to say that there is no other machine learning technique as relevant to life as reinforcement learning. It is not only its origin—reinforcement learning is rooted in psychological experiments—but also the fact that reinforcement learning ideas can be found in philosophical documents dating back to at least the time of Plato. Even today, reinforcement learning can tell me how to achieve happiness.

If you think I am biased, I will agree with you. This is my interpretation of the philosophical texts I have read, the technical books I teach, my studies on reinforcement learning, and—of course—my own experiences. Thus, I am going to start from Epicurus, who is one of my favourite philosophers because he is misunderstood as a hedonist, while he was actually practising and teaching a theory akin to reinforcement learning.

Epicurus believed in optimising a reward function across one’s life. He explicitly said that pursuing the pleasures of the flesh is not the point. For the exact phrasing, I will direct you to my other text *["Was Epicurus the Father of Reinforcement Learning?"](https://arxiv.org/abs/1710.04582)*, which is based on my talk for our Machine Learning retreat in 2017. Back then, I thought that since I became the head of the Machine Learning group, I no longer needed to prove myself by giving technical talks. Besides, at 9:00 am, people would rather hear something different for a change. It was meant to be an amusing talk.

Epicurus suggests we should choose the actions that benefit our souls. This is a very interesting choice of words for someone who limited the presence of gods in his philosophy and who didn’t believe in the afterlife. I will therefore interpret avoiding actions that bring turmoil in our souls as what is good in the long run. Epicurus was suggesting that we also incorporate any future punishment that today’s pleasure will bring. If you have ever got drunk, you certainly know what I am talking about.

Please forgive me if I turn everything into an equation. I assure you that on the surface it is a most trivial one, though in its essence it is the meaning of it all. We can write the total returns of an action at time t to be:

$$
G(t) = R(t+1) + R(t + 2) + R(t + 3) + … + R(t + N)
$$

where $R(t)$ represents the reward-or, if negative, punishment-at time point $t$. Rewards can of course, be 0 meaning the absence of it. 

In writing this equation, I assume that I will eventually die, that my moments of pleasure and punishment are finite, and their importance doesn’t diminish with how far in the future they are. It is an important notion as I cannot maximise a function of infinite value. Otherwise, if I feel invincible, which I occasionally still do, I will have to write:

$$
G(t) = R(t+1) + R(t + 2)\gamma + R(t + 3)\gamma^2 + R(t + 4)\gamma^3 + …
$$

Now I can live forever, but there is a discount factor $\gamma$ multiplying each reward that I receive, a positive real number $\gamma < 1$, which is raised to a power depending on how far away in the future the reward is. This just tells me that the reward I receive now will always be better than the same amount of reward that I will get next year, and it explains why I am so impatient.

In the Epicurean philosophy, there are clear instructions or suggestions regarding the values of various actions. For instance, Epicurus’ advice is to pursue friendship rather than romance because the latter brings jealousy and pain. I will therefore write in an equation that the value of friendship is higher than the value of romance:

$$
Q(\text{friendship}) > Q(\text{romance})
$$

where $Q$ is the value of the action to be considered and is interpreted as the expected sum of all the future rewards (and punishments), discounted, of course, that can result from this action. Here, there is an omission: saying that the action is independent of our state is clearly an oversimplification. For instance, we can consider a state $S_t$ that includes our own "state of mind" and the other person involved at time $t$. The value of friendship itself must also depend on the person we choose to offer our friendship. A more complete statement is therefore:

$$
Q(S(t), \text{friendship}) > Q(S(t), \text{romance})
$$

You can argue that the correctness of this inequality may very well depend on the specific state $s$, but if I sample a state, on average this is more likely to be true. Of course, nobody says that $Q(S_t, \text{friendship})$ cannot have a very low value itself if investing in friendship with the wrong person, but it is likely to still be higher than $Q(S_t, \text{romance})$ if investing in romance with the same wrong person.

Implicit here is the investment in all these actions. Any action of friendship, or of anything else in fact, rarely comes for free: it typically involves some effort. In this framework, and to keep things simple, I may consider the investment as a negative reward, i.e., something that I pay now in order to get a higher return in the future. The update rule for learning the $Q$ values according to the well-known [SARSA algorithm](http://incompleteideas.net/book/the-book.html) is:

$$
\Delta Q(S_t, A_t) = \alpha \left( R(t) + \gamma Q(S_{t+1}, A_{t+1}) \right) - Q(S_t, A_t)
$$

which I can interpret as "total reward minus expected reward." The value $Q(S_t, A_t)$ is the expectation for immediate and future rewards that I will receive when I am in state $S_t$ and choose action $A_t$. If my prediction is correct, $Q(S_t, A_t)$ should be equal to the immediate reward I will receive as a consequence of my action  $R_{t+1}$ plus the $Q$ value of the future state-action pair $(S_{t+1}, A_{t+1})$, i.e., my expected reward for the future state $S_{t+1}$ when taking future action $A_{t+1}$, discounted. If my expectation is wrong, then the terms do not match, and I need to update $Q(S_t, A_t)$.

Given my investments on the way to my goal, represented as negative rewards, I need, and perhaps expect, future rewards that are large enough to compensate my investment and that arrive before they feel heavily discounted.

Therefore, receiving a smaller reward than anticipated can feel like punishment: the difference is negative. The film that your friend told you is amazing might disappoint you if you watch it with great expectations. Correspondingly, I remember watching the end of *["Lost"](https://en.wikipedia.org/wiki/Lost_(TV_series))* many years after its first airing, having often heard how awful it was. However, it didn’t seem quite as bad to me. You see, after all the negative comments I had heard, my expectations were pretty low.

In friendship or romance, or indeed anything else, great expectations and high investment are likely to lead to disappointment. Reinforcement learning suggests you should have low expectations in situations you cannot really control and should avoid over-investment. In doing so, any reward is more likely to feel rewarding and any punishment as less punishing.

Ongoing investments also lead to a natural bias in the perception of people. I, as an external observer for someone else, may be aware of signs of success (rewards) but I am likely unaware of their investments (punishments). On the contrary, I am perfectly aware of my own investments, and therefore any perception of my personal success may feel less to me than in the eyes of other people. Sometimes it may even feel like a punishment: since I consumed the punishment first, the success may not be enough to make up for it—after all, I am impatient!

This is how reinforcement learning tells me to live my life: enjoy simple things; do not expect too much from others; do not over-invest; and never underestimate the effort or investment that people made in reaching their goals. Is there an element of luck? Reinforcement learning says there is, though unless you are trapped in local maxima, you will eventually find the optimal solution given sufficient time. This, however, is a discussion for another time.

*Eleni*

**Acknowledgements:** Thanks to Peter Dayan for his amazingly fast feedback on this text (as always!) and for pointing me to the work of [Kent Berridge](https://www.ncbi.nlm.nih.gov/pubmed/28943891), who makes the distinction between “liking” vs “wanting,” as well as the work of [Robb Rutledge](https://www.ncbi.nlm.nih.gov/pubmed/25092308), a modern-day Epicurus who proposed a computational model of momentary subjective happiness. Also to ChatGPT for meticulous proofreading. 


## A brief introduction to NumPy

**NumPy** is a fundamental library for numerical computing in Python. It provides the efficient, flexible multi-dimensional array object (`ndarray`) and a wide range of mathematical functions. In machine learning, data is most naturally represented as matrices because:

- **Data Organisation:** Datasets are typically organised in matrices, where each row is a data sample and each column a feature.
- **Efficient Computation:** Many machine learning algorithms (e.g. linear regression, neural networks, PCA) rely on linear algebra operations (e.g. matrix multiplication, transposition) that are highly optimised in NumPy.
- **Vectorised Operations:** NumPy’s support for vectorised computations means that operations on large datasets can be performed quickly without explicit loops.

### Fundamental Matrix Operations

Below are key matrix operations with definitions, motivations, and examples.

#### 1. Matrix Multiplication

**Definition:** For matrices $A$ (of shape $m \times n$) and $B$ (of shape $n \times p$), the product $C = A \times B$ is an $m \times p$ matrix where:

$$
C_{ij} = \sum_{k=1}^{n} A_{ik} \times B_{kj}
$$

**Motivation:** Matrix multiplication is used in transforming data, computing neural network activations, and many other machine learning applications.

**Example:**

In [4]:
import numpy as np

# Define two 2x2 matrices
A = np.array([[1, 2],
              [3, 4]])
B = np.array([[5, 6],
              [7, 8]])

# Matrix multiplication using the @ operator
product = A @ B
print("Matrix product (A @ B):\n", product)

Matrix product (A @ B):
 [[19 22]
 [43 50]]


#### 2. Transposition

**Definition:** The transpose of a matrix is obtained by flipping it over its diagonal. For matrix $A$, the transpose $A^T$ is defined as $(A^T)_{ij} = A_{ji}$.

**Motivation:** Transposition is essential for aligning dimensions, computing covariance matrices, and solving systems of equations.

**Example:**

In [7]:
# Transpose of matrix A
A_transpose = A.T
print("Transpose of A:\n", A_transpose)

Transpose of A:
 [[1 3]
 [2 4]]


#### 3. Element-wise Multiplication

**Definition:** Multiplies corresponding elements of two arrays of the same shape. For arrays $A$ and $B$, $(A * B)_{ij} = A_{ij} \times B_{ij}$.

**Motivation:** Often used for scaling, applying activation functions, or combining feature maps in neural networks.

**Example:**

In [9]:
# Element-wise multiplication of A and B
elementwise_product = A * B
print("Element-wise multiplication:\n", elementwise_product)

Element-wise multiplication:
 [[ 5 12]
 [21 32]]


### 4. Dot Product (for Vectors)

**Definition:** The dot product of two vectors $\mathbf{a}$ and $\mathbf{b}$ is defined as:

$$
\mathbf{a} \cdot \mathbf{b} = \sum_{i} a_i ~ b_i
$$

**Motivation:** Measures similarity between vectors and is used in optimisation and similarity computations.

**Example:**

In [10]:
v1 = np.array([1, 2, 3])
v2 = np.array([4, 5, 6])
dot = np.dot(v1, v2)
print("Dot product of v1 and v2:", dot)

Dot product of v1 and v2: 32


### Exercises

Work through the exercises below. For each exercise, enter your solution in the provided code block. The correct solution is hidden and will be revealed only when you click the **Show Solution** button.


### Exercise 1: Matrix Transposition and Multiplication

**Task:** Create a 3×3 NumPy array with values from 1 to 9. Compute its transpose, then multiply the original matrix by its transpose.

```python
# Your solution here

# Create the 3x3 array
matrix = np.array([...])  # Replace [...] with your code

# Compute the transpose
matrix_T = ...

# Multiply the matrix by its transpose
result = ...

print("Original matrix:\n", matrix)
print("Transpose:\n", matrix_T)
print("Matrix multiplied by its transpose:\n", result)
```


<details>
  <summary>Show Solution</summary>
  
  ```python
  # Solution for Exercise 1
  matrix = np.array([[1, 2, 3],
                     [4, 5, 6],
                     [7, 8, 9]])
  matrix_T = matrix.T
  result = matrix @ matrix_T
  
  print("Original matrix:\n", matrix)
  print("Transpose:\n", matrix_T)
  print("Matrix multiplied by its transpose:\n", result)
  ```
</details>