The goal of this assignment is to **re-familiarize** you with the frameworks you will be using in the rest of this class. We will be using **Python** as our programming language and **PyTorch** as our deep learning framework.

Most of this assignment should be similar to work you have done in a past machine learning course (e.g. COS324). We encourage you to try our your own implementations for each problem, and we also encourage reading through online documentation for help. To guide you along each assignment, we have provided a loose guide of what steps you might take, but these are by no means a hard requirement as long as you solve the given question.

# Question 1.d and 1.e
Suppose instead of boxes, we now deal with a *continuous* set of possible decisions. In this scenario, our agent can choose any number $a \in \mathbb{R}$, and the reward given to the player is defined by

$$
    \mathcal{R}(x) = \max \biggr \{ 0, 1 - |x - 1| \biggr\}
$$

Suppose our agent chooses its action by sampling from a normal distribution $\mathcal{N}(0.6, 0.09)$ (to avoid confusion, the standard deviation $\sigma=0.3$ and the variance $\sigma^2=0.09$). Use sampling to estimate the expected return of this policy. **Report your answer to the nearest tenths. You are responsible for using enough samples that all digits of your answer are accurate.** To familiarize yourself with vectorized operations, **do not use for-loops when computing this value.**

In [None]:
import torch
import torch.nn.functional as F
from torch.distributions import Normal

import matplotlib.pyplot as plt

# Read through the documentation: https://pytorch.org/docs/stable/distributions.html

def reward (x: torch.tensor) -> torch.tensor:
  """
    Compute the reward function elementwise over a tensor input.
  """
  return F.relu(1 - torch.abs(x-1))

In [None]:
### Your code here.

r = reward(...)
print("Expected reward:", r.mean())

### Question 1.e
Plot the reward distribution under this policy using matplotlib.pyplot.hist() or some other histogram plotting library. Make sure to label your axes for this problem (x-axis is reward, y-axis is probability density).

In [None]:
### Your code here.

plt.hist(...)
plt.show()

# Question 3
In this problem, you will code a barebones version of gradient descent using numpy, as well as an analogous version using torch. The goal is to re-familiarize you with the inner-workings of gradient descent, as well as the standard formulations we use in practice.

### Question 3, Part (a): Implementing gradient descent for a toy problem from scratch without the use of torch.

In this assignment, you will implement gradient descent from scratch to solve a classification problem. Suppose we have an extremely simple linear classifier. Let $N$ be the number of data points in our dataset. Given a dataset $X \in \mathbb{R}^{N \times 10}$ and corresponding target values $y \in \mathbb{R}^{N}$, we want to find the weights $W \in \mathbb{R}^{10}$ that best fits the data $y = XW$ according to MSE loss using gradient descent. As a reminder, the MSE loss is

$$ L(W) = \frac{1}{2N} (XW - y)^T (XW - y) $$

**Remark.** There is a closed form solution to this problem which I'm sure many of you know, but the purpose of this exercise is just a quick way of remembering how gradient descent is implemented. The gradients are purposely quite simple to compute, so for this assignment please use gradient descent.



In [None]:
import numpy as np

In [None]:
# Import the dataset
rng = np.random.RandomState(189289213)
X = 10 * rng.rand(1000, 10) # feature matrix
y = np.dot(X, [1,2,3,4,5,6,7,8,9,10]) + np.random.normal(0, 0.01) # target vector

def init_weights (in_1:int, in_2:int) -> np.ndarray:
  return np.random.rand(in_1,in_2)

weights = init_weights(X.shape[1], 1)

# Helper functions to get you started. Feel free to use or not use them.
def gradient_descent(X, y, weights: np.ndarray, eta: float, iterations: int):
  for i in range(iterations):
    ### Your code here.
    pass

def grad():
  ### Your code here.
  pass

print("Solved weights:", weights)

### Question 3, Part (b)
Now rewrite the solution above using the standard torch optimization loop. Define a loss (MSE), call loss.backward(), and use the weights.grad to perform gradient descent.

In [None]:
X_t = torch.from_numpy(X).float()
y_t = torch.from_numpy(y).float()

weights = torch.randn((X.shape[1], 1), requires_grad=True)

def gradient_descent_torch(X, y, weights: torch.tensor, eta: float, iterations: int) -> None:
  for i in range(iterations):
    pass
    # Your code here

gradient_descent_torch(X_t, y_t, weights, 0.001, 500)
print(weights)

### Optional Question for the Interested!
To familiarize yourself with PyTorch, we are going to build a simple model and implement the standard training loop. The goal of this assignment is to build a really (dumb) agent that can "learn" (maybe not successfully) how to play a game.

Even if you already know how, **do not implement algorithms such as Deep Q Networks or policy gradients**. We know that they already (provably) work! The purpose of this exercise is to naively apply deep learning to an RL-style problem and see what happens.

In [None]:
import gym

In [None]:
# From https://colab.research.google.com/github/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_12_01_ai_gym.ipynb#scrollTo=XYwy9cjlJjEH

# Helper functions
def env_info(name:str):
    env = gym.make(name, new_step_api=True)
    spec = gym.spec(name)
    print(f"Action Space: {env.action_space}")
    print(f"Observation Space: {env.observation_space}")
    print(f"Max Episode Steps: {spec.max_episode_steps}")
    print(f"Nondeterministic: {spec.nondeterministic}")
    print(f"Reward Range: {env.reward_range}")
    print(f"Reward Threshold: {spec.reward_threshold}")

env_info("MountainCar-v0")

Action Space: Discrete(3)
Observation Space: Box([-1.2  -0.07], [0.6  0.07], (2,), float32)
Max Episode Steps: 200
Nondeterministic: False
Reward Range: (-inf, inf)
Reward Threshold: -110.0


  and should_run_async(code)


The purpose of the code below is to generate a video of the environment for you to watch. Now, using the standard torch optimization pipeline, try to design a simple deep learning model that plays a bunch of games of MountainCar and tries to maximize the reward.

In [None]:
# HIDE OUTPUT
!pip install gym pyvirtualdisplay > /dev/null 2>&1
!apt-get install xvfb > /dev/null 2>&1
!apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1

In [None]:
## From https://colab.research.google.com/github/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_12_01_ai_gym.ipynb#scrollTo=XYwy9cjlJjEH
## This is for rendering is Colab

import gym
from gym.wrappers.record_video import RecordVideo
import glob
import io
import base64
from IPython.display import HTML
from pyvirtualdisplay import Display
from IPython import display as ipythondisplay

display = Display(visible=0, size=(1400, 900))
display.start()

"""
Utility functions to enable video recording of gym environment
and displaying it.
To enable video, just do "env = wrap_env(env)""
"""


def show_video():
    mp4list = glob.glob('video/*.mp4')
    if len(mp4list) > 0:
        mp4 = mp4list[0]
        video = io.open(mp4, 'r+b').read()
        encoded = base64.b64encode(video)
        ipythondisplay.display(HTML(data='''<video alt="test" autoplay
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
    else:
        print("Could not find video")


def wrap_env(env):
    env = RecordVideo(env, './video',  episode_trigger = lambda episode_number: True)
    # env = Monitor(env, './video', force=True)
    return env

For this problem, **you can design a model that uses any arbitrary loss function or any artbitrary input.**

The minimal requirements for this problem are to train an agent on some kind of loss function using a deep learning model defined by the standard torch pipeline (e.g. define an optimizer, loss.backward(), optimizer.step()). You will most likely find that your algorithm does not work.

**Short Answer Question**:
Think about why a standard function mapping from state to action does not work well in this problem. What would you do to resolve this?

In [None]:
### Training

train_env = gym.make("MountainCar-v0")
# TRAIN_ITERATIONS = 10000
# model = torch.nn.Sequential(...)
# optimizer = ...

for i in range(TRAIN_ITERATIONS):
  # Your code here.
  pass

### Evaluation

env = wrap_env(gym.make("MountainCar-v0", render_mode="rgb_array"))

state = env.reset()

while True:
  env.render()
  # Your model will select actions here instead of random sampling, e.g. action = model(state)
  action = env.action_space.sample()
  state, reward, done, info = env.step(action)

  if done:
    break

env.close()
show_video()

  logger.warn(
