
## Key Terms

#### Environment
What are you trying to solve? : Fantasy Football Drafting

#### Model: PPO

#### Agent: 
The entity that interacts with the environment using an algorithm

#### Observation (State): 
Important details of the environment that are fed to the model to make action predictions. (Vectors)

#### Action: 
What the agent does in the environment. "Pick QB Round 1" Could be an example of an action

#### Step: 
Progress in the environmeent. Can be thought of like FPS, where each "frame" is a step. 
In general a "step" in the environment will take an action for the agent to do, and return a new observation and reward for the step.

#### Action Spaces, 
Discrete - Clear classifications (Go left or go right)
Continuouse - Like regression (Go 0.02 right, or 0.5 or 0.34421; an infinite range; used in robotics)

In [7]:
# https://stable-baselines3.readthedocs.io/en/master/guide/algos.html
# pip install stable-baselines3
# pip install gymnasium
# pip install shimmy < - no necessary

In [8]:
# Reinforcement learning is less about the algorithm. You can definately teak hyperparameters in your algorithm, but in general it is going to come down to your environment, your reward mechanism, and your observation
# that you feed in to your algorithm. Of-course they take a long time to train. It will definately learn, changing the hyperparameter will not change too much, at least you can screw it up, but the defalt values is good enough.
# You can eek out maybe an extra 5% - 10% performance possbily by tweaking hyperparameters but the actual gains are going to come much from algorithm + reward mechanisms 

In [9]:
import gymnasium as gym

env = gym.make("LunarLander-v2", render_mode="human")  # 'human' for on-screen rendering

env.reset()

print("Sample Action: ", env.action_space.sample())
print("Observation space shape: ", env.observation_space.shape)
print("Sample Observation: ", env.observation_space.sample())


DeprecatedEnv: Environment version v2 for `LunarLander` is deprecated. Please use `LunarLander-v3` instead.

: 

In [6]:
# Observation is a flat vector; Action space value of some value that we can pass

In [None]:
import gymnasium as gym

env = gym.make("LunarLander-v2", render_mode="human")  # 'human' for on-screen rendering
env.reset()

for _ in range(100):
    env.render()
    action = env.action_space.sample()  # Random action
    env.step(action)

env.close()


In [None]:
import gymnasium as gym

# Create the environment
env = gym.make("LunarLander-v2", render_mode="human")
obs, info = env.reset()  # Unpack the reset output (obs, info)

for _ in range(300):
    env.render()
    action = env.action_space.sample()  # Sample a random action
    obs, reward, done, truncated, info = env.step(action)  # Update to include 'truncated'
    print(reward)
    if done or truncated:  # Check for termination or truncation
        obs, info = env.reset()  # Reset the environment

env.close()


-2.7209968941035525
0.6309611213330868
1.3893545117414317
-3.0650638771274314
0.782838231519321
1.6946636320761013
-4.443180653857257
1.07915960659912
1.9539777353993213
1.3028523586336007
-4.093596742394072
0.4335571947000869
0.08998179985292268
-4.258824790837354
0.8881893113424439
0.8720805948562429
0.8499337485937986
-0.08327735965903457
1.448532440636286
-2.260400519646157
-0.5529555116512472
1.2237947627107826
-1.6127487895232548
-0.26065390396388355
0.3878541326509435
1.4623567299024114
1.28783112204252
-0.7375719180885187
-0.9504422960529257
-0.11995339961248419
-0.22859769617255665
-2.5395326474321562
0.7424472395102566
-1.2683813145451086
-0.43731867684482495
0.8856015751954953
-1.5762734682805604
-0.06513594453774657
-1.7129236789162394
0.09437963013809395
-0.9241979842080639
-1.155345332418807
-2.156397575096963
-2.629836624257069
-0.04630437063245835
-2.504228108844461
-1.7034917449273508
0.44513023694574483
-2.353381323446274
-0.29621210206283877
-1.5767105071936942
-0.64

KeyboardInterrupt: 

In [6]:
import gymnasium as gym
from stable_baselines3 import A2C

# Create the environment
env = gym.make("LunarLander-v3", render_mode="human")
obs, info = env.reset()  # Unpack the reset output (obs, info)

model = A2C("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=10000)

episodes = 10

for ep in range(episodes):
    obs = env.reset()
    done = False

    while not done:

        env.render()
        action = env.action_space.sample()  # Sample a random action
        obs, reward, done, truncated, info = env.step(action)  # Update to include 'truncated'
        # print(reward)
        if done or truncated:  # Check for termination or truncation
            obs, info = env.reset()  # Reset the environment

env.close()


Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 91       |
|    ep_rew_mean        | -250     |
| time/                 |          |
|    fps                | 42       |
|    iterations         | 100      |
|    time_elapsed       | 11       |
|    total_timesteps    | 500      |
| train/                |          |
|    entropy_loss       | -1.29    |
|    explained_variance | -0.0151  |
|    learning_rate      | 0.0007   |
|    n_updates          | 99       |
|    policy_loss        | -20.3    |
|    value_loss         | 375      |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 107      |
|    ep_rew_mean        | -334     |
| time/                 |          |
|    fps                | 44       |
|    iterations         | 200      |
|    time_elapsed 

In [None]:
# len_mean - how long it survives
# rew_mean - reward mean



In [None]:
import gymnasium as gym
from stable_baselines3 import PPO

# Create the environment
env = gym.make("LunarLander-v3", render_mode="human")
obs, info = env.reset()  # Unpack the reset output (obs, info)

model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=10000)

episodes = 10

for ep in range(episodes):
    obs = env.reset()
    done = False

    while not done:

        env.render()
        action = env.action_space.sample()  # Sample a random action
        obs, reward, done, truncated, info = env.step(action)  # Update to include 'truncated'
        # print(reward)
        if done or truncated:  # Check for termination or truncation
            obs, info = env.reset()  # Reset the environment

env.close()


Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 95.8     |
|    ep_rew_mean     | -168     |
| time/              |          |
|    fps             | 44       |
|    iterations      | 1        |
|    time_elapsed    | 46       |
|    total_timesteps | 2048     |
---------------------------------
----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 96         |
|    ep_rew_mean          | -165       |
| time/                   |            |
|    fps                  | 45         |
|    iterations           | 2          |
|    time_elapsed         | 90         |
|    total_timesteps      | 4096       |
| train/                  |            |
|    approx_kl            | 0.00911714 |
|    clip_fraction        | 0.0413     |
|    clip_range           | 0.2        |
|    entropy_loss         | -1.38

In [3]:
import gymnasium as gym
from stable_baselines3 import PPO

# Create the environment
env = gym.make("LunarLander-v3", render_mode="human")
obs, info = env.reset()  # Unpack the reset output (obs, info)

model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=100000)

episodes = 10

for ep in range(episodes):
    obs = env.reset()
    done = False

    while not done:

        env.render()
        action = env.action_space.sample()  # Sample a random action
        obs, reward, done, truncated, info = env.step(action)  # Update to include 'truncated'
        # print(reward)
        if done or truncated:  # Check for termination or truncation
            obs, info = env.reset()  # Reset the environment

env.close()


Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 89.5     |
|    ep_rew_mean     | -220     |
| time/              |          |
|    fps             | 46       |
|    iterations      | 1        |
|    time_elapsed    | 43       |
|    total_timesteps | 2048     |
---------------------------------


KeyboardInterrupt: 